Closed jtara1 closed 3 years ago
from the code -- these items cannot be scraped.
BAD_CHUNKS = ['careers', 'contact', 'about', 'faq', 'terms', 'privacy', 'advert', 'preferences', 'feedback', 'info', 'browse', 'howto', 'account', 'subscribe', 'donate', 'shop', 'admin']
BAD_DOMAINS = ['amazon', 'doubleclick', 'twitter']
I'm surprised to see I haven't found a news site that this can't scrape. It's met my needs anyways.
Yes, Newspaper has a lot of flexibility. I recently started putting together a detailed Newspaper3k usage document that I publicly share. The document is available here: https://github.com/johnbumgarner/newspaper3_usage_overview. Please let me know if you see anything that is missing or needs more clarification.
u can't scrape javascript rendered pages
u can't scrape javascript rendered pages
What javascript rendered site are you trying to scrape?
Which domains or webpages can be scraped?