dimones / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

Feedparser using Enclosure text for GUID #131

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
>>> feedparser.__version__
'4.1'
>>> d = feedparser.parse('http://news.sky.com/sky-news/rss/world-news/
rss.xml')
>>> d.entries[0].id
u'http://news.sky.com/sky-news/content/StaticFile/jpg/2008/Aug/
Week1/15071073.jpg'
>>> d.entries[0].enclosures
[{'width': u'95', 'length': u'123456', 'href': u'http://news.sky.com/sky-
news/content/StaticFile/jpg/2008/Aug/Week1/15071073.jpg', 'type': 
u'letterboximage(95x53)', 'height': u'53'}]

As per the attached grab of Sky News, there is no GUID for any of the 
articles, but feedparser is choosing the enclosure's text to use as a 
GUID, which seems a little unintuitive. Is this correct behaviour? I would 
have expected the id attribute of the entry to be None.

Original issue reported on code.google.com by nickmurd...@gmail.com on 5 Aug 2008 at 10:30

Attachments:

GoogleCodeExporter commented 9 years ago
Attached is minimal example of the problem. It contains two items. The one 
without
GUID gets image url of enclosure as id.

Original comment by floridan...@gmail.com on 4 Jun 2010 at 3:00

Attachments:

GoogleCodeExporter commented 9 years ago
This behavior is expected due to the advent of iTunes podcast feeds, 
apparently. There are many unit tests in feedparser's itunes test directories 
that specifically check to make sure that enclosure URLs are mapped to item IDs 
if no ID is present.

I recommend closing this bug.

Original comment by kurtmckee on 4 Dec 2010 at 3:37

GoogleCodeExporter commented 9 years ago
Marked as WontFix since it's actually a feature not a bug. See the following 
for background information:
http://techbrew.net/articles/200709/rss-the-guid-problem/
http://www.xml.com/pub/a/2004/08/18/pilgrim.html
http://diveintomark.org/archives/2002/09/26/rss_20_template

Original comment by adewale on 4 Dec 2010 at 10:49

GoogleCodeExporter commented 9 years ago
Is there then currently some other way to check whether an actual guid is given 
in the feed? If not, then I'm not sure if this should be closed WontFix. As per 
the link you give, people may want to implement a specific strategies of how to 
identify items in case a guid is missing; those strategies may not be 
compatible with using the enclosure as the first fallback.

In fact, I am writing a podcatcher-like application, and this is precisely such 
a case; since the enclosure to me is very much the piece of attached *data*, I 
specifically want to give publishers the ability to change the enclosure url 
without detecting this as a new item. This is one of the reasons why I am 
currently running a patched version of feedparser.

In addition, considering that Atom feeds allow more than one enclosure, and we 
arbitrarily need to choose the first to use as a guid, this policy of using the 
enclosure as a guid by default seems a bit like a kludge to me.

While I personally believe that feedparser itself should not select the 
enclosure as a fallback at all, and instead require the user to deal with a 
missing guid (really only the case for people writing an actual aggregator in 
the first place), if we do chose to do this, I would at least advocate an 
alternative way to get to the raw guid/atom:id.

Original comment by elsdoer...@gmail.com on 5 Dec 2010 at 6:28

GoogleCodeExporter commented 9 years ago
+1 on elsdoerfer's comment above. It should at least be configurable with an 
argument to parse(), since this clearly isn't desirable behaviour in some cases.

Original comment by nickmurd...@gmail.com on 5 Dec 2010 at 8:22

GoogleCodeExporter commented 9 years ago
@elsdoerfer: Well said. You're correct, feedparser has included 
application-specific logic that should be left to developers. It was 
implemented to mimic how iTunes (the application) handles missing guids [1], 
but as you've pointed out, that decision should be left to developers to decide 
for their particular use cases.

I've attached a patch, and included a list of test cases that are no longer 
necessary after the patch is applied. The changes are also available in a git 
branch at:

https://github.com/kurtmckee/feedparser/tree/issue131

@Adewale: I think elsdoerfer has made an excellent point, and I strongly 
recommend reconsidering the decision to close this report.

[1]: http://www.apple.com/itunes/podcasts/specs.html
"If you omit the guid for an episode, the episode url will be used instead."

Tests to remove:

/illformed/itunes/itunes_enclosure_url_maps_id.xml
/illformed/itunes/itunes_enclosure_url_maps_id_2.xml
/wellformed/itunes/itunes_enclosure_url_maps_id.xml
/wellformed/itunes/itunes_enclosure_url_maps_id_2.xml
/wellformed/itunes/itunes_link_enclosure_maps_id.xml
/wellformed/itunes/itunes_link_enclosure_maps_id_2.xml

Original comment by kurtmckee on 6 Dec 2010 at 10:34

Attachments:

GoogleCodeExporter commented 9 years ago
This is minor, but for fwiw, line 17 in the patch can also be removed, since 
the href variable is no longer referenced.

Original comment by elsdoer...@gmail.com on 7 Dec 2010 at 2:27

GoogleCodeExporter commented 9 years ago
Good catch! Git branch updated, new patch file attached. Tested in Python 2.4 - 
2.7.

Original comment by kurtmckee on 7 Dec 2010 at 8:01

Attachments:

GoogleCodeExporter commented 9 years ago
Apple makes little distinction between iTunes the product, iTunes the store and 
the iTunes extensions for podcasting. They even go so far as to tell people 
that the way to verify their feeds are correct is to point iTunes (the product) 
at the feed.

On the one hand this means that the only correct way to implement the iTunes 
extensions is to rigidly follow whatever iTunes (the product) does. On the 
other hand this means attempting to mimic a closed source piece of software 
rather a specification.

I've applied the patch (in revision 332) because I don't think we should be in 
the business of trying to reverse engineer application functionality.

Although if the spec changes in a way that clears this up, I'm open to 
revisiting this issue.

Original comment by adewale on 12 Dec 2010 at 11:49