disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

Stop updating snapshot for articles with too many 404 #75

Closed pm5 closed 4 years ago

pm5 commented 4 years ago

There are some URLs with HTTP 404 error:

2020-02-18 02:30:11 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://www.ptt.cc/bbs/HatePolitics/M.1579175212.A.E0F.html>: HTTP status code is not handled or not allowed
2020-02-18 02:30:11 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://www.lookerpets.com/post12214491091063>: HTTP status code is not handled or not allowed
2020-02-18 02:30:11 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://www.readthis.one/post11207421092250>: HTTP status code is not handled or not allowed
2020-02-18 02:30:11 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://www.taiwan.cn/plzhx/wyrt/201912/t20191226_12228090.htm>: HTTP status code is not handled or not allowed
2020-02-18 02:30:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.dcard.tw/_api/posts/232973648/>: HTTP status code is not handled or not allowed
2020-02-18 02:30:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://www.ptt.cc/bbs/Gossiping/M.1579322008.A.66B.html>: HTTP status code is not handled or not allowed
2020-02-18 02:30:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://www.ptt.cc/bbs/HatePolitics/M.1579177593.A.EF7.html>: HTTP status code is not handled or not allowed
2020-02-18 02:30:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://www.readthis.one/post11088161096342>: HTTP status code is not handled or not allowed
2020-02-18 02:30:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://www.cna.com.tw/news/firstnews/201912100147.aspx>: HTTP status code is not handled or not allowed
2020-02-18 02:30:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://www.taiwan.cn/plzhx/wyrt/201912/t20191226_12228091.htm>: HTTP status code is not handled or not allowed

The update logic currently will keep checking these URLs for new snapshots, I think? We should ignore these updates when they have accumulated certain numbers, say 3, of 404 errors.

pm5 commented 4 years ago

Duplicates #69