codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.09k stars 2.11k forks source link

TIPS FOR IMPROVEMENT #978

Open aleksandar-devedzic opened 10 months ago

aleksandar-devedzic commented 10 months ago

I have extracted some meta tags, you can try to identify title, text, description and date by replacing provided tags in :

meta[property='{}'] meta[name='{}'] meta[itemprop='{}']

Meta tags for publication and modification date:

published_date published_time cXenseParse:publishtime pubdate publish_date PublishDate dcterms.created rnews:datePublished article:published_time prism.publicationDate displaydate OriginalPublicationDate og:published_time datePublished article_date_original article.published published_time_telegram sailthru.date datePublished date Date original-publish-date DC.date.issued dc.date DC.Date parsely-pub-date publishtime publication_date uploadDate coverageEndTime publishdate publish-date publishedAtDate dcterms.date publishedDate creationDateTime pub_date updated_time og:updated_time datemodified last-modified Last-Modified DC.date.modified article:modified_time modified_time modifiedDateTime dc.dcterms.modified lastmod

Meta tags for title:

dc.title og:title headline articletitle article-title parsely-title title

Meta tags for description:

description og:description

Meta tags for body: articleBody articleText

FYI It would be good if you can fix/improve/adapt the code so that it can extract full information from these websites since these websites are the most popular websites in the world. By "full information" i mean title, publication date and article body

CNN - https://edition.cnn.com/ BBC News - https://www.bbc.com/news Reuters - https://www.reuters.com/ The New York Times - https://www.nytimes.com/ The Guardian - https://www.theguardian.com/international Al Jazeera - https://www.aljazeera.com/ Associated Press (AP) News - https://apnews.com/ NBC News - https://www.nbcnews.com/ Fox News - https://www.foxnews.com/ USA Today - https://www.usatoday.com/ ABC News - https://abcnews.go.com/ CBS News - https://www.cbsnews.com/ The Washington Post - https://www.washingtonpost.com/ Time - https://time.com/ Forbes - https://www.forbes.com/ Bloomberg - https://www.bloomberg.com/ The Wall Street Journal - https://www.wsj.com/ The Huffington Post - https://www.huffpost.com/ The Independent - https://www.independent.co.uk/ The Sydney Morning Herald - https://www.smh.com.au/ The Economist - https://www.economist.com/ The Times of India - https://timesofindia.indiatimes.com/ The Daily Mail - https://www.dailymail.co.uk/home/index.html The Telegraph - https://www.telegraph.co.uk/ The Sun - https://www.thesun.co.uk/ The Mirror - https://www.mirror.co.uk/ The Daily Beast - https://www.thedailybeast.com/ The Atlantic - https://www.theatlantic.com/ National Geographic - https://www.nationalgeographic.com/ Science Daily - https://www.sciencedaily.com/ The Verge - https://www.theverge.com/ Wired - https://www.wired.com/ TechCrunch - https://techcrunch.com/ Engadget - https://www.engadget.com/ Mashable - https://mashable.com/ Forbes India - https://www.forbesindia.com/ Hindustan Times - https://www.hindustantimes.com/ CNN Business - https://www.cnn.com/business Financial Times - https://www.ft.com/ CNBC - https://www.cnbc.com/ Business Insider - https://www.businessinsider.com/ Politico - https://www.politico.eu/ The Hill - https://thehill.com/ The Washington Times - https://www.washingtontimes.com/ The Boston Globe - https://www.bostonglobe.com/ The LA Times - https://www.latimes.com/ The Chicago Tribune - https://www.chicagotribune.com/ The Sydney Morning Herald - https://www.smh.com.au/ The Globe and Mail - https://www.theglobeandmail.com/ The Toronto Star - https://www.thestar.com/

AndyTheFactory commented 10 months ago

Hi @aleksandar-devedzic , i forked newspaper3k and in the next version your suggestions are implemented (code is at the moment in the work-0.9.2 branch, but if you need it, you can pull it from there. alternatively, you can wait for the release ;)

here is my fork https://github.com/AndyTheFactory/newspaper4k

aleksandar-devedzic commented 10 months ago

Oh, thanks I hope that I helped you with this. All best

On Thu, Nov 16, 2023 at 10:26 PM Andrei P. @.***> wrote:

Hi @aleksandar-devedzic https://github.com/aleksandar-devedzic , i forked newspaper3k and in the next version your suggestions are implemented (code is at the moment in the work-0.9.2 branch, but if you need it, you can pull it from there. alternatively, you can wait for the release ;)

here is my fork https://github.com/AndyTheFactory/newspaper4k

— Reply to this email directly, view it on GitHub https://github.com/codelucas/newspaper/issues/978#issuecomment-1815339686, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATCV65J6TZYTQGMGX2NJOSLYE2AIDAVCNFSM6AAAAAA7OUG6XGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJVGMZTSNRYGY . You are receiving this because you were mentioned.Message ID: @.***>

2dareis2do commented 6 months ago

I had issue with bbc where the nest p tags in divs. Newspaper4k seems to work perfectly after installing typing-extensions e.g. pip install typing-extensions

Thanks

aleksandar-devedzic commented 6 months ago

One more improvement (about dates)... Sometimes you can find publication dates in URL so you can also check that as a last option...

2dareis2do commented 6 months ago

On dates why does bbc article prepend date to _text string now?

e.g.

Published\n\n8 March\n\n

source https://www.bbc.co.uk/news/uk-england-london-68511760