covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Scraper for Argentina #404

Closed jzohrab closed 4 years ago

jzohrab commented 4 years ago

Original issue https://github.com/covidatlas/coronadatascraper/issues/633, transferred here on Friday Apr 03, 2020 at 20:02 GMT


https://www.argentina.gob.ar/coronavirus/informe-diario

This is from the federal government. They are publishing two PDFs per day. "Vespertino" = evening, "Matutino" = morning. They're probably meeting minutes.

Pros:

Cons:

jzohrab commented 4 years ago

(Transferred comment)

I've made an initial attempt at this in my argentina branch. Need help, because what i'm doing breaks caching.

From my slack message:

In short: there’s a webpage to links to PDFs; as of late there are two PDFs per day. So the strategy is to 1) parse the main page to get the links, and 2) get which PDFs are for the desired scrape date.

Their page maintains old PDFs, but our cache doesn’t. So I did something that’s probably inappropriate - I force getting the files even if they’re not in the cache. That way, if I try to scrape April 1st, it will get the main page, find the two PDFs for April 1st, and cache them. But the way it works now, it will replace files in the April 1st cache with what is retrieved today.

Eventually parsing this will be another story, but for now, I’d like someone to take a look and let me know what is a better way to at least retrieve all the PDFs and store them in our cache. I assumed that this strategy was better than caching every PDF for every day, but maybe that’s better?