Closed pSpitzner closed 4 years ago
If you are only talking about caching the download, we could check for the 'last-modified' header and only pull a new version if it is newer than a local one.
If you are taking about caching the get_* methods lru cache should work but I do not think this will get us much performance.
I like @semohr 's idea, makes sense for the class to store the modified date.
This would need to be defined per source: from our current ones, only Google returns a last-modified in the headers. JHU we could scrape commit date I guess, and from RKI we could use Datenstand.
urllib.request.urlopen('https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv').headers.items()
Out[30]:
[('Vary', 'Accept-Encoding'),
('Accept-Ranges', 'bytes'),
('Content-Type', 'text/csv'),
('Content-Length', '14090517'),
('Date', 'Tue, 21 Apr 2020 10:29:49 GMT'),
('Expires', 'Wed, 21 Apr 2021 10:29:49 GMT'),
('Cache-Control', 'public, max-age=31536000'),
('Last-Modified', 'Fri, 17 Apr 2020 00:18:22 GMT'),
('X-Content-Type-Options', 'nosniff'),
('X-Robots-Tag', 'noindex'),
('Server', 'sffe'),
('X-XSS-Protection', '0'),
('Alt-Svc',
'quic=":443"; ma=2592000; v="46,43",h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,h3-T050=":443"; ma=2592000)
JHU
urllib.request.urlopen('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv').headers.items()
Out[33]:
[('Connection', 'close'),
('Content-Length', '63631'),
('Content-Type', 'text/plain; charset=utf-8'),
('Cache-Control', 'max-age=300'),
('Content-Security-Policy',
"default-src 'none'; style-src 'unsafe-inline'; sandbox"),
('ETag',
'W/"eb2b872fe3dffa18fe5668d2145a8335faf912a1aa6d871153f1d52adda44a9f"'),
('Strict-Transport-Security', 'max-age=31536000'),
('X-Content-Type-Options', 'nosniff'),
('X-Frame-Options', 'deny'),
('X-XSS-Protection', '1; mode=block'),
('Via', '1.1 varnish (Varnish/6.0)'),
('X-GitHub-Request-Id', 'F8E4:63E6:311F2E:3DD53F:5E9ECCAA'),
('Accept-Ranges', 'bytes'),
('Date', 'Tue, 21 Apr 2020 10:42:40 GMT'),
('Via', '1.1 varnish'),
('X-Served-By', 'cache-hhn4058-HHN'),
('X-Cache', 'HIT, HIT'),
('X-Cache-Hits', '2, 1'),
('X-Timer', 'S1587465761.669682,VS0,VE1'),
('Vary', 'Authorization,Accept-Encoding'),
('Access-Control-Allow-Origin', '*'),
('X-Fastly-Request-ID', 'd96337c661ab3460bff4aa857b3507dc81c3132d'),
('Expires', 'Tue, 21 Apr 2020 10:47:40 GMT'),
('Source-Age', '81')]
We should think about where to store it, perhaps in a folder data/ which is added to the .gitignore file. In general I find it a good idea
Working on it :+1:
JHU still missing, right? looks great though!
JHU is still missing since there is also no "last-modified" header for github. But one could get the last commit date for https://github.com/CSSEGISandData/COVID-19. We would need to use pygit for that, is it fine to add that?
On the other hand we wanted to change the rki date check to use the arcgis api. I tried to only filter for the meldedatum here and check for the newest one in the list, but that feels kinda hacky too. Is there a way to only get the date the dataset updated the last time as query?
Additionally removing the os.path.getmtime()
seems like a good idea since that is depending on the operating system and could lead to problems down the road.
I was thinking of creating a dict with the different last-updated dates and saving this dict to a file. (Has to be added to the gitignore)
I think RKI is fine at the moment (I reworked it a bit, I don't know if you saw it @semohr ): current version uses Datenstand
as last-modified, which is the only one comparable number between different sources.
For JHU, I suggest we just set auto_download = True
, and fallback to local if it fails: files are very small (~200 kb total) and grow at a rate of ~600 new numbers per day, so they'll remain small.
I did not see the RKI changes yet, but they look good.
Sounds like a good suggestion will work on that :+1:
We should cache the data, ideally between runs (module loads) or at least for the active session.
@joaopn @semohr any suggestions?