codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.1k stars 2.11k forks source link

Not able to parse publish date. #234

Open kartikparnami opened 8 years ago

kartikparnami commented 8 years ago

-> Newspaper is unable to parse date for the URL: http://pratyushsharma.blogspot.in/2016/03/jindagi-mauth-na-ban-jaye-samhalo-yaaron.html -> On seeing page source publishing date can be seen in the line: <abbr class='published' itemprop='datePublished' title='2016-03-31T03:55:00-07:00'>3:55 AM</abbr>

-> Newspaper is able to get till this DOM element by matching it with

{'attribute': 'itemprop', 'value': 'datePublished', 'content': 'datetime'}

in line 203 in extractors.py -> But the content attribute is not matched instead a new attribute title is needed. -> We need to add

{'attribute': 'itemprop', 'value': 'datePublished', 'content': 'title'}

to the PUBLISH_DATE_TAGS to get this read properly.

yprez commented 8 years ago

@kartikparnami good find. Are you sure it's supposed to be "title" and it's not a mistake of this specific blog?

https://schema.org/datePublished says the attribute should be "content", other sources refer to "datetime", but I couldn't find any other examples with "title"...

kartikparnami commented 8 years ago

Well, I dont know how widespread this issue is and whether its a blog specific issue. But, I feel an addition just increases our coverage of the cases. Let me know your thoughts.

yprez commented 8 years ago

Similar to #151

mamoit commented 7 years ago

402 changes the behaviour to follow the schema of datePublished.

Doesn't solve this problem in particular, but this seems to be an isolated case of out of spec metadata.

saqibaliXIQ commented 5 years ago

not able to parse date of many domain articles marketscreener.com contagionlive and many other but diffbot does but its paid