codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.09k stars 2.11k forks source link

TIPS FOR FAST IMPROVEMENT #969

Open aleksandar-devedzic opened 1 year ago

aleksandar-devedzic commented 1 year ago

I have extracted some meta tags, you can try to identify title, text, description and date by replacing provided tags in :

meta[property='{}'] meta[name='{}'] meta[itemprop='{}']

Meta tags for publication and modification date:

published_date published_time cXenseParse:publishtime pubdate publish_date PublishDate dcterms.created rnews:datePublished article:published_time prism.publicationDate displaydate OriginalPublicationDate og:published_time datePublished article_date_original article.published published_time_telegram sailthru.date datePublished date Date original-publish-date DC.date.issued dc.date DC.Date parsely-pub-date publishtime publication_date uploadDate coverageEndTime publishdate publish-date publishedAtDate dcterms.date publishedDate creationDateTime pub_date updated_time og:updated_time datemodified last-modified Last-Modified DC.date.modified article:modified_time modified_time modifiedDateTime dc.dcterms.modified lastmod

Meta tags for title:

dc.title og:title headline articletitle article-title parsely-title title

Meta tags for description:

description og:description

Meta tags for body: articleBody articleText

FYI It would be good if you can fix/improve/adapt the code so that it can extract full information from these websites since these websites are the most popular websites in the world. By "full information" i mean title, publication date and article body

CNN - https://edition.cnn.com/ BBC News - https://www.bbc.com/news Reuters - https://www.reuters.com/ The New York Times - https://www.nytimes.com/ The Guardian - https://www.theguardian.com/international Al Jazeera - https://www.aljazeera.com/ Associated Press (AP) News - https://apnews.com/ NBC News - https://www.nbcnews.com/ Fox News - https://www.foxnews.com/ USA Today - https://www.usatoday.com/ ABC News - https://abcnews.go.com/ CBS News - https://www.cbsnews.com/ The Washington Post - https://www.washingtonpost.com/ Time - https://time.com/ Forbes - https://www.forbes.com/ Bloomberg - https://www.bloomberg.com/ The Wall Street Journal - https://www.wsj.com/ The Huffington Post - https://www.huffpost.com/ The Independent - https://www.independent.co.uk/ The Sydney Morning Herald - https://www.smh.com.au/ The Economist - https://www.economist.com/ The Times of India - https://timesofindia.indiatimes.com/ The Daily Mail - https://www.dailymail.co.uk/home/index.html The Telegraph - https://www.telegraph.co.uk/ The Sun - https://www.thesun.co.uk/ The Mirror - https://www.mirror.co.uk/ The Daily Beast - https://www.thedailybeast.com/ The Atlantic - https://www.theatlantic.com/ National Geographic - https://www.nationalgeographic.com/ Science Daily - https://www.sciencedaily.com/ The Verge - https://www.theverge.com/ Wired - https://www.wired.com/ TechCrunch - https://techcrunch.com/ Engadget - https://www.engadget.com/ Mashable - https://mashable.com/ Forbes India - https://www.forbesindia.com/ Hindustan Times - https://www.hindustantimes.com/ CNN Business - https://www.cnn.com/business Financial Times - https://www.ft.com/ CNBC - https://www.cnbc.com/ Business Insider - https://www.businessinsider.com/ Politico - https://www.politico.eu/ The Hill - https://thehill.com/ The Washington Times - https://www.washingtontimes.com/ The Boston Globe - https://www.bostonglobe.com/ The LA Times - https://www.latimes.com/ The Chicago Tribune - https://www.chicagotribune.com/ The Sydney Morning Herald - https://www.smh.com.au/ The Globe and Mail - https://www.theglobeandmail.com/ The Toronto Star - https://www.thestar.com/