covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Disable or compress timeseries-tidy.csv.gz generation #323

Closed jzohrab closed 4 years ago

jzohrab commented 4 years ago

This report continually fails with out of memory on Lambda ... strange!

I've spent far too long on this, and would either like to change the report, or discontinue it.

To change it, we could remove the redundant info. Below is an excerpt of the current file:

MacBook-Air:li jeff$ head ~/Downloads/timeseries-tidy.csv
locationID,slug,name,level,city,county,state,country,lat,long,population,aggregate,tz,date,type,value
iso1:ad,ad,AD,country,,,,Andorra,42.55,1.58,76177,,Europe/Andorra,2020-01-23,cases,0
iso1:ad,ad,AD,country,,,,Andorra,42.55,1.58,76177,,Europe/Andorra,2020-01-23,deaths,0
iso1:ad,ad,AD,country,,,,Andorra,42.55,1.58,76177,,Europe/Andorra,2020-01-23,recovered,0
iso1:ad,ad,AD,country,,,,Andorra,42.55,1.58,76177,,Europe/Andorra,2020-01-24,cases,0
iso1:ad,ad,AD,country,,,,Andorra,42.55,1.58,76177,,Europe/Andorra,2020-01-24,deaths,0
iso1:ad,ad,AD,country,,,,Andorra,42.55,1.58,76177,,Europe/Andorra,2020-01-24,recovered,0
iso1:ad,ad,AD,country,,,,Andorra,42.55,1.58,76177,,Europe/Andorra,2020-01-25,cases,0
iso1:ad,ad,AD,country,,,,Andorra,42.55,1.58,76177,,Europe/Andorra,2020-01-25,deaths,0
iso1:ad,ad,AD,country,,,,Andorra,42.55,1.58,76177,,Europe/Andorra,2020-01-25,recovered,0

Every line contains the slug, name, level, ... etc. If we only include locationID, date, type, and value, that should be sufficient info for consumers, and will reduce the file size significantly.

Changing the file to only contain locationID, name, date, type, value : file is 61.2 MB. Also exclude name: 47.5 MB

Sample:

locationID,date,type,value
iso1:ad,2020-01-23,cases,0
iso1:ad,2020-01-23,deaths,0
iso1:ad,2020-01-23,recovered,0
iso1:ad,2020-01-24,cases,0
iso1:ad,2020-01-24,deaths,0
iso1:ad,2020-01-24,recovered,0
iso1:ad,2020-01-25,cases,0
iso1:ad,2020-01-25,deaths,0
iso1:ad,2020-01-25,recovered,0

I could then provide a location.csv report that has the rest of the info, eg.

locationID,slug,name,level,city,county,state,country,lat,long,population
iso1:ad,ad,AD,country,,,,Andorra,42.55,1.58,76177

We already provide location.json, so I could just provide a CSV version of that.

47.5 MB is approx the same as timeseries-byLocation.json, so perhaps that will be sufficient. Checking with the sole consumer (Ryan Cooper) who said he's using it.

jzohrab commented 4 years ago

I've emailed Ryan Cooper to see if he's ok with the change.

elaborative commented 4 years ago

I have also noticed memory issue with the tidy CSV format. My Shiny app has been failing recently due to memory errors, likely as a result of file growth. I thought the error was actually coming from inside Shiny due to their in-memory app size restrictions on the free plans, but it may a memory error at AWS during download via Shiny server. Shiny's debugging / logging is not great so, a bit unclear where the error lies - but definitely memory related.

Addressing your question - 100% agreed on removing location names and replacing w/ ISO country codes and/or FIPS county codes if practical/available. This would help reduce greatly reduce payload and create a more compliant tidy format. This will also improve overall accuracy for integration with census and other public data sources - I have seen some issues specifically w/ reported US county names not lining up with official registers/sources due to abbreviations, eg. ST. LUCIE COUNTY vs SAINT LUCIE COUNTY. Using FIPS would resolve these mismatches and enable better uphill association to CBSA and MSA region units.

Please go ahead w/ the changes. I will rewire to expect codes (instead of names) and let you know if it changes the behavior I see in Shiny. A location index as CSV would be great - or I can use the JSON if needed. R handles all standard formats easily. I suspect reducing file size will resolve the memory errors you are seeing (for now).

Shiny's data plans are not very competitive in pricing - so I am concurrently working on a robust dotnet/R infrastructure I can run on my own private cloud at Rackspace. Once complete, I'll let you know if I see any differences in behavior without Shiny in the middle.

jzohrab commented 4 years ago

Ah, nice, thanks very much for the detail Ryan / @elaborative !

I'm going to create two reports: timeseries-tidy-small.csv and locations.csv, which combined will give you what you need. The locations.csv uses the same source data as locations.json. I've merged https://github.com/covidatlas/li/pull/324 with those new files, which should be up in staging soon (under 2 hours?), and then once it's OK I'll promote it to prod and will kill the other report.

I'll let you know here when those reports are in staging, and then when they're in prod.

I have no idea what the issue is with this file ... it's not that big, and gzipping it shouldn't fail. But the smaller files make life easier.

At some point, we'll have an API up which returns the data in smaller slices. The content will be the same, but maybe that will be a nice option for you.

jzohrab commented 4 years ago

The new reports are up in staging:

I'll promote this to production. The file sizes there should be similar.

jzohrab commented 4 years ago

Same reports are up in prod, eg

Closing this, @elaborative let me know how these things work out for you. Eventually, these beta reports will be moved to "release" of some different URL. Will let you know, cheers! jz