euagendas / semeval_8_2022_ia_downloader

internet archive downloader for task 8 at semeval
Other
7 stars 3 forks source link

download IA excluded links from live pages #1

Closed hide-ous closed 3 years ago

hide-ous commented 3 years ago

Some links are excluded from the IA, e.g.,

2021-08-02 09:45:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://nltimes.nl/2020/02/26/coronavirus-authorities-fear-german-tourist-brought-covid-19-netherlands>: HTTP status code is not handled or not allowed
2021-08-02 09:45:25 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://eu.indystar.com/story/news/health/2020/05/01/indiana-reopening-timeline-coronavirus-pandemic/3059275001/>: HTTP status code is not handled or not allowed
2021-08-02 09:45:38 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://www.thedailybeast.com/twitter-deleted-sheriff-clarkes-wildly-reckless-coronavirus-tweets-so-he-says-hes-going-to-parler?source=articles&via=rss&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+thedailybeast
%2Farticles+%28The+Daily+Beast+-+Latest+Articles%29>: HTTP status code is not handled or not allowed
2021-08-02 09:45:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://www.businessinsider.com/patient-thanks-medical-workers-hospital-window-note-critical-care-2020-3?utm_source=feedburner&amp%3Butm_medium=referral&utm_medium=feed&utm_campaign=Feed%3A+businessinsider+%28Business+Inside
r%29>: HTTP status code is not handled or not allowed
2021-08-02 09:45:51 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://nationalpost.com/pmn/health-pmn/toyota-extends-shutdown-of-north-american-plants-through-april-17>: HTTP status code is not handled or not allowed

I think this is because someone requested the links to be taken down from IA.

We should try to download these pages from the original sites using newspaper3k, and output them in the same format as the rest.

hide-ous commented 3 years ago

one way to go about this is to add a callback for error handling (like so) here

or changing the middleware or downloader

hide-ous commented 3 years ago

closed with 08f3551