Tallefer / prssr

Automatically exported from code.google.com/p/prssr
GNU General Public License v2.0
0 stars 0 forks source link

Relative URLs in items not working properly #37

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Sign up to a feed using relative URIs to link to images
2. Set up caching for images
3. Update feed

What is the expected output? What do you see instead?
Expected result is caching the image. The result is "Error downloading file
'/wp-content/...jpg'."

What version of the product are you using? On what operating system?
Version 1.4.1 on Windows Mobile 2003 SE

Please provide any additional information below.
I'm not sure whether the feeds use of relative links is standards
compliant. I tried the Google Reader and it works. Sample URL:
http://www.markenblog.de/feed/

Original issue reported on code.google.com by eljo...@gmail.com on 8 Aug 2008 at 8:02

GoogleCodeExporter commented 9 years ago
No, relative links are not allowed, see
http://www.feedvalidator.org/check.cgi?url=http%3A%2F%2Fwww.markenblog.de%2Ffeed
%2F

So GReader must be taking the server name from the <link> tag...

Original comment by and...@gmail.com on 8 Aug 2008 at 8:22

GoogleCodeExporter commented 9 years ago
Still, the feedvalidator confirms the feed is valid. It states that it's 
recommended
to use absolute links since some feed readers cannot process relative ones.
I've been looking into the RSS specification, but it does not clarify matters. 
Except
for stating that all RSS files have to be XML 1.0 compliant.

[http://www.w3.org/TR/REC-xml/#sec-external-ent XML 1.0] itself states that 
"relative
URIs are relative to the location of the resource within which the entity 
declaration
occurs". As far as I understand it, this means any and all relative URIs should 
be
resolve to full URIs with the help of either the document's location or external
information.
Additionally, the (not really relevant) XML Base specification
(http://www.w3.org/TR/xmlbase/#syntax), states that unless an xml:base 
attribute is
present, relative URIs should be resolved using the XML file's URL (in this 
case the
feed file). The full sequence is shown in 4.1.

Original comment by eljo...@gmail.com on 8 Aug 2008 at 9:04

GoogleCodeExporter commented 9 years ago
Ok, let me give you a counterexample. Let's say we have two blogs, each blog 
has one
blogpost, both post with relative URL. Now we have a agragating feed, that 
inclides
both blogs, so it contains two articles. Now, the feed itself lies on different 
URL,
and both articles are on different URLs. Thus, the location of the feed can not 
be
used as the base, and the only thing we can use is the <link> tag, which 
contains the
URL of the article.

My point here is, that taking the base URL from <link> is more robusts and is 
more
probable to work in such insane situations. During the development, I had to do
couple of patches, that were fixing the problems of feed, simply because the 
feed did
not followed standards. I like standard, but people (implementors) must follow 
them
and not interpret them by themself. Nothing personal, just a general complain 
:-)

Of course, the problem is reproducible, so it will be quite easy to fix it...

Original comment by and...@gmail.com on 10 Aug 2008 at 6:11

GoogleCodeExporter commented 9 years ago
Your example of an aggregator exposes the very weak spot of using the <link> 
tag (and
the base url). The (very) well known feedburner feeds use a redirection URL as 
the
item's <link> target. This means that using the <link> URL as a base will break 
those
feeds (if they're already "broken" ;). Now, feedburner adds an additional tag,
<feedburner:origLink>, that could be used for a dirty workaround, but so far 
the only
other reliable URL in those feeds would be the <link> tag of <channel>. This,
however, will not work with items joined from multiple feeds as well. 
Fortunately I
did not encounter feedburner feeds with relative URLs.

So I guess whatever you choose to implement will break in once case or another 
- were
it the <link> of the item, the <link> of the channel or the URI of the feed. I,
personally, agree: It would be really nice if every content provider would 
adhere to
the recommendations, lacking a not-so-simple RSS specification. And I, too, 
agree
that implementing this "feature" is an unpleasant workaround for a gap in the 
spec.

BTW: I've found some more feeds that rewrite the <link> target to use an 
off-site
"counter" service. Fortunately those do not use relative links within the HTML 
code.
(E.g. slashdot, boingboing.net, both probably using a customized feedburner 
service).

Original comment by eljo...@gmail.com on 10 Aug 2008 at 6:31

GoogleCodeExporter commented 9 years ago
One more thing I found is this: 
http://cyber.law.harvard.edu/rss/relativeURI.html

Briefly, when xml:base is present, use it. If not, use /rss/channel/link for the
base, which I think is the way to go.

Original comment by and...@gmail.com on 10 Aug 2008 at 7:31

GoogleCodeExporter commented 9 years ago

Original comment by and...@gmail.com on 11 Aug 2008 at 6:40

GoogleCodeExporter commented 9 years ago
Did the fix, but the included HTML in http://www.markenblog.de/feed/ is 
confusing
libsgml, so it is not parsed correctly. For now this feed is not working 
correctly
until I fix the libsgml parser.

The problem is that there is no space between attributes, example:
<a href="http://www.ipeg.eu/blog/?p=304"target=_blank">

Original comment by and...@gmail.com on 11 Aug 2008 at 10:19

GoogleCodeExporter commented 9 years ago
I'm not sure whether this is expected: in r69 it tries to fetch URLs like
http:///wp-content/... - note the triple slash and the lack of the hostname. So 
for
this feed it's not working.

Original comment by eljo...@gmail.com on 25 Aug 2008 at 5:57

GoogleCodeExporter commented 9 years ago
It is the result of the combination of libsgml + no-space-between-attrs as I 
wrote in
#7. So the code is ok, but another bug which has to be fixed is in libsgml.

Original comment by and...@gmail.com on 25 Aug 2008 at 1:16

GoogleCodeExporter commented 9 years ago
Ah, so a third fallback to the feed's URL is not implemented (yet).

Original comment by eljo...@gmail.com on 25 Aug 2008 at 1:27

GoogleCodeExporter commented 9 years ago
If I'm right, this feed was fixed and there are no missing spaces between 
attributes.
Thus I do not have to patch libsgml.

Original comment by and...@gmail.com on 13 Sep 2008 at 5:53