disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

update parallelism issue: many connection errors #74

Open andreawwenyi opened 4 years ago

andreawwenyi commented 4 years ago

python3 execute_spider.py --update has repeatedly shown errors like this one when loading active urls. When running update on only 1 site (in the following example, ETtoday, would be python3 site.py update 102), we'd not see this error and the article can be updated as expected. The problem therefore is believed to be related to CrawlerProcess used in execute_spider.py.

2020-02-18 01:46:59 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.ettoday.net/news/20191209/1598206.htm>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/scrapy/core/downloader/middleware.py", line 44, in process_request

    defer.returnValue((yield download_func(request=request, spider=spider)))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]

Reference:

andreawwenyi commented 4 years ago

This issue is related to #70. ns.py( which use CrawlerRunner instead of CrawlerProcess) behaves the similar way. It was previously thought that CrawlerRunner threw twisted.web._newclient.ResponseNeverReceived when the dcard posts is 404. However, after much testing & reading the log, I've found that many of the dcard posts that result in such error are actually active.

pm5 commented 4 years ago

It also happens a lot on the following sites on discover recently:

Error downloading <GET http://www.itaiwannews.cn/>
Error downloading <GET http://www.tailian.org.cn/>
Error downloading <GET http://www.whb.cn/>

Since it seems to only happen on a subset of the sites, I would guess it is more of some problems on the sites' ends. I set delay = 2 for these sites for now. Let's see if things improve. Also I'm not sure why RetryMiddleware is not working here. Or is it?