disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

ns.py update 無法處理404 #70

Closed andreawwenyi closed 4 years ago

andreawwenyi commented 4 years ago

dcard 的 update spider可以處理404,但如果用ns.py update ,利用twisted一起跑所有site的時候,404 的網站似乎會直接斷在twisted connection,因此無法妥善處理。 除此之外也發現一些沒有404的網頁,也會出現以下錯誤。

dcard 404 網頁:

2020-02-16 17:55:02 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.dcard.tw/_api/posts/232906257/>
Traceback (most recent call last):
  File "/Users/wyw/.local/share/virtualenvs/NewsScraping-I6zyEuYv/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    defer.returnValue((yield download_func(request=request, spider=spider)))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]

沒有壞掉的網頁:

2020-02-16 18:02:39 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.coco01.today/post/1173075>
Traceback (most recent call last):
  File "/Users/wyw/.local/share/virtualenvs/NewsScraping-I6zyEuYv/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    defer.returnValue((yield download_func(request=request, spider=spider)))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]

希望可以

andreawwenyi commented 4 years ago

Note: This issue is related to #57

andreawwenyi commented 4 years ago

open new related issue #74.