AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
429 stars 37 forks source link

After downloading a few hundred articles it mass fails #546

Open AndyTheFactory opened 10 months ago

AndyTheFactory commented 10 months ago

Issue by steeljardas Thu Dec 30 00:43:41 2021 Originally opened as https://github.com/codelucas/newspaper/issues/927


So I am using newspaper3k to mass download articles while scraping Google, I noticed that after a couple of hours of downloading hundreds of different articles it continuously gives me an error when doing article.parse() because the article was not downloaded. This happens to every single URL from that point onwards until I wait for a little bit then if I restart the scraping after waiting 5-10 mins it works again.

What could be the issue?

AndyTheFactory commented 10 months ago

Comment by banagale Thu Dec 30 00:54:57 2021


Probably Google or an intermediary is temporarily banning the IP.

AndyTheFactory commented 10 months ago

Comment by steeljardas Thu Dec 30 02:00:40 2021


Probably Google or an intermediary is temporarily banning the IP.

Google isn't banning it because I'm getting the links from google still, however, newspaper isn't being able to download them. Or at least not being able to parse them since that's what's triggering the errors. (In fact I usually check for H2 tags before parsing and it actually manages to get them but once I try parsing it triggers errors.

AndyTheFactory commented 10 months ago

Comment by johnbumgarner Thu Dec 30 03:43:40 2021


There could be several problems. Can you share your code?

AndyTheFactory commented 10 months ago

Comment by steeljardas Thu Dec 30 12:04:37 2021


There could be several problems. Can you share your code?

here: https://pastebin.com/uAH8Mx2s

It's a bit messy but essentially it googles the keyword, grabs the 10 links, goes into each of them and downloads them using newspaper.

then uses beautiful soup to grab the H2, however, when the issue I mention in the OP happens, it keeps erroring out when doing article.parse()

AndyTheFactory commented 10 months ago

Comment by johnbumgarner Thu Dec 30 16:09:18 2021


So based on your code you are querying Google via search -- https://google.com/search?q={query2}

This methodology will throw errors with Newspaper3k and the BeautifulSoup. I would recommend adding some error handling in your code.

Here is my Stack Overflow answer on error handling with Newspaper3k.

https://stackoverflow.com/questions/69728117/newspaper3k-filter-out-bad-url-while-extracting/69729136#69729136

Take a look at this for handling soup errors -

https://www.tutorialspoint.com/beautiful_soup/beautiful_soup_trouble_shooting.htm

I would also recommend adding a random sleep function

from time import sleep
from random import randint

# this sleep timer is helping with some timeout issues
# that were happening when querying
sleep(randint(1, 5))

Please let me know if you need any additional support.

P.S. Break your code into functions

AndyTheFactory commented 10 months ago

Comment by johnbumgarner Thu Dec 30 16:15:15 2021


Also take a look at my NewsPaper3k Usage Document.

I will look at adding a search example to my NewsHound project, which should be released in the coming weeks. I'm waiting on @banagale to finish his tests before the code is released 😊

AndyTheFactory commented 10 months ago

Comment by steeljardas Thu Dec 30 23:42:47 2021


So based on your code you are querying Google via search -- https://google.com/search?q={query2}

This methodology will throw errors with Newspaper3k and the BeautifulSoup. I would recommend adding some error handling in your code.

Here is my Stack Overflow answer on error handling with Newspaper3k.

https://stackoverflow.com/questions/69728117/newspaper3k-filter-out-bad-url-while-extracting/69729136#69729136

Take a look at this for handling soup errors -

https://www.tutorialspoint.com/beautiful_soup/beautiful_soup_trouble_shooting.htm

I would also recommend adding a random sleep function

from time import sleep
from random import randint

# this sleep timer is helping with some timeout issues
# that were happening when querying
sleep(randint(1, 5))

Please let me know if you need any additional support.

P.S. Break your code into functions

Yea recently changed it to handle the errors that way, I've been getting this often too: [WinError 3] The system cannot find the path specified: 'C:\Users\STEELH~1\AppData\Local\Temp\.newspaper_scraper\article_resources'

This happens after a few hours of non stop scraping/ downloading articles. And every single link gets this error from this point onwards for some reason until I stop the program and run again.

AndyTheFactory commented 10 months ago

Comment by johnbumgarner Thu Dec 30 23:50:44 2021


this path C:\Users\STEELH~1\AppData\Local\Temp.newspaper_scraper\article_resources' is used for storing content and for garbage collection. I'm going to assume that the resource becomes available for some reason.

Have you tried to increase the size of your temp directory?

AndyTheFactory commented 10 months ago

Comment by steeljardas Thu Dec 30 23:55:31 2021


this path C:\Users\STEELH~1\AppData\Local\Temp.newspaper_scraper\article_resources' is used for storing content and for garbage collection. I'm going to assume that the resource becomes available for some reason.

Have you tried to increase the size of your temp directory?

It shouldn't have a limit aside from the actual SSD capacity (which still has plenty) hence why I'm not sure why it's happening

AndyTheFactory commented 10 months ago

Comment by johnbumgarner Thu Dec 30 23:57:23 2021


Can you post your current code to paste bin so I can look at it again?

AndyTheFactory commented 10 months ago

Comment by steeljardas Fri Dec 31 00:32:50 2021


Can you post your current code to paste bin so I can look at it again?

It's the same as I posted above except the "except":

try: article.parse() except (newspaper.article.ArticleException,OSError) as e: print(e)

everything else is the exact same. (Also you mentioned about me scraping google with the query thing but I'm doing that with requests not with newspaper, I do that to grab the website links, those links are the ones that I download with newspaper3k afterwards)

AndyTheFactory commented 10 months ago

Comment by johnbumgarner Fri Dec 31 15:44:53 2021


Your code is very hard to read. I would recommend breaking it into at least 3 functions, which will help with me and you troubleshooting. If you open a question on Stack Overflow I will help you debug the code more.

Also what is your use case for scraping google for keywords and extracting content?

AndyTheFactory commented 10 months ago

Comment by tsoukanas Fri Apr 1 14:05:14 2022


No news from you johnbumgarner concerning the newshound project ever since! Would be glad to contribute to the code, as soon as you release it. Take care, cheers!

AndyTheFactory commented 10 months ago

Comment by johnbumgarner Tue Apr 19 16:40:16 2022


@tsoukanas the initial BETA release of the project is almost done. I'm currently trying to figure out how to improve the speed of extraction, which seems slow. I'm also the only one writing and testing the code, which takes time to iron out the bugs.

BTW I have already written all the documentation for the BETA release. One feature that I won't be adding is any NLP stuff unless there becomes a reason to add it later.