codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.07k stars 2.11k forks source link

Is there a list of supported sites? #840

Closed jtara1 closed 3 years ago

jtara1 commented 4 years ago

Which domains or webpages can be scraped?

johnbumgarner commented 3 years ago

from the code -- these items cannot be scraped.

BAD_CHUNKS = ['careers', 'contact', 'about', 'faq', 'terms', 'privacy', 'advert', 'preferences', 'feedback', 'info', 'browse', 'howto', 'account', 'subscribe', 'donate', 'shop', 'admin']

BAD_DOMAINS = ['amazon', 'doubleclick', 'twitter']

jtara1 commented 3 years ago

I'm surprised to see I haven't found a news site that this can't scrape. It's met my needs anyways.

johnbumgarner commented 3 years ago

Yes, Newspaper has a lot of flexibility. I recently started putting together a detailed Newspaper3k usage document that I publicly share. The document is available here: https://github.com/johnbumgarner/newspaper3_usage_overview. Please let me know if you see anything that is missing or needs more clarification.

ahadafzal commented 3 years ago

u can't scrape javascript rendered pages

johnbumgarner commented 3 years ago

u can't scrape javascript rendered pages

What javascript rendered site are you trying to scrape?