codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.12k stars 2.11k forks source link

ENH: parse schema.org/NewsArticle RDFa, Microdata, or JSONLD #448

Open westurner opened 7 years ago

westurner commented 7 years ago

Schema.org Linked Data:

Parsers:

Tasks:

westurner commented 7 years ago

251 mentions itemProp="datePublished" (Microdata)

The RDFa for this property could be:

westurner commented 5 years ago

Extruct is the best tool for accomplishing this, IMHO https://github.com/scrapinghub/extruct

extruct.extract() https://github.com/scrapinghub/extruct/blob/master/extruct/_extruct.py

https://github.com/RDFLib/rdflib/issues/770#issuecomment-433655142

simonm3 commented 5 years ago

Great package but wondering why schema.org is not included as most newspapers and media sites seem to use it. Are there any plans to add this?