Open kartikparnami opened 8 years ago
@kartikparnami good find. Are you sure it's supposed to be "title" and it's not a mistake of this specific blog?
https://schema.org/datePublished says the attribute should be "content", other sources refer to "datetime", but I couldn't find any other examples with "title"...
Well, I dont know how widespread this issue is and whether its a blog specific issue. But, I feel an addition just increases our coverage of the cases. Let me know your thoughts.
Similar to #151
Doesn't solve this problem in particular, but this seems to be an isolated case of out of spec metadata.
not able to parse date of many domain articles marketscreener.com contagionlive and many other but diffbot does but its paid
-> Newspaper is unable to parse date for the URL: http://pratyushsharma.blogspot.in/2016/03/jindagi-mauth-na-ban-jaye-samhalo-yaaron.html -> On seeing page source publishing date can be seen in the line:
<abbr class='published' itemprop='datePublished' title='2016-03-31T03:55:00-07:00'>3:55 AM</abbr>
-> Newspaper is able to get till this DOM element by matching it with
in line 203 in extractors.py -> But the content attribute is not matched instead a new attribute title is needed. -> We need to add
to the PUBLISH_DATE_TAGS to get this read properly.