HaveF / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

Missing item.title in a feed #400

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Was testing against http://admin.taiwan.net.tw/rssB_en.aspx
2. feedparser.parse(url)

>>> f['items'][0].title
u'Tourism Bureau,M.O.T.C. Republic of China(Taiwan)'

The RSS xml:
<item><title>Tourism Bureau Launches New Taiwan Tourism Brand</title>

What is the expected output? What do you see instead?

The title should be correct.

What version of the product are you using? On what operating system?

Please provide any additional information below.

Original issue reported on code.google.com by nora.ols...@gmail.com on 10 May 2013 at 1:54

GoogleCodeExporter commented 9 years ago
Looking at the documentation, it seems to be picking up this 
<dc:title>Tourism Bureau,M.O.T.C. Republic of China(Taiwan)</dc:title>

Original comment by nora.ols...@gmail.com on 10 May 2013 at 1:57

GoogleCodeExporter commented 9 years ago
Unfortunately this is proper behavior for feedparser but the site is using the 
dc:title tag incorrectly. It is not possible to choose which the "correct" 
title programmatically. I recommend contacting the site and informing them that 
they are using dc:title incorrectly. The email address I found after browsing 
the English version of their site is:

tbroc@tbroc.gov.tw

In case Google Code scrubs out that email address, it is:

tbroc at tbroc dot gov dt tw

Original comment by kurtmckee on 10 May 2013 at 9:20

GoogleCodeExporter commented 9 years ago

I kinda suspected that it wasn't correct. Do you have any good reference to 
read up o the Dublin Core?

I found this reference:
http://web.resource.org/rss/1.0/modules/dcterms/

"""The RSS 1.0 <title> element is a sub-property of <dc:title> and as a result 
if <dc:title> is not present it can be assumed that <title> is also the 
<dc:title> for the element."""

Looking at the code, I don't think I can access the underlying xml dom object?

BTW, Google Reader works fine though. 

Original comment by nora.ols...@gmail.com on 10 May 2013 at 4:25

GoogleCodeExporter commented 9 years ago
http://dublincore.org/documents/dces/

That describes the Dublin Core Element Set version 1.1. The dcterms namespace 
has more detailed descriptions about the elements, value ranges, and so forth. 
The dc namespace is generally described by the link above, however.

feedparser uses a SAX parser and it does not internally build a DOM interface.

I don't know how Google Reader does their parsing, but when they shut down the 
service on July 1st perhaps they'll open source the parser so that others can 
learn how it works. As it stands, however, I can only conjecture that one of 
the ways it works is by choosing not to support the Dublin Core elements.

Original comment by kurtmckee on 11 May 2013 at 3:10