datasets / covid-19

Novel Coronavirus 2019 time series data on cases
https://datahub.io/core/covid-19
1.16k stars 604 forks source link

Data has gone stale #127

Open sigmondzavion opened 2 years ago

sigmondzavion commented 2 years ago

The last update appears to be 4/16.

Q: Do you have an ETA for making the data current?

aminoplis commented 2 years ago

Hi, any answer to this?

alan-isaac commented 2 years ago

Still stale. :-(

seun-beta commented 1 year ago

Hello @anuveyatsu

I hope you are good.

I would love to take up this task to update the data on this repo.

I am currently working on it at the moment.

seun-beta commented 1 year ago

Hello @anuveyatsu

I was able to discover an issue. The GitHub Actions workflow fails because of the large size of the CSV files which is over 100MB (the max file size for GitHub).

I am of the idea that the the result should be written to CSV, compressed and then zipped so as to reduce the size OR the Paraquet should be used as a file format.

Please let me know what you think about it.

anuveyatsu commented 1 year ago

Thank you @seun-beta for spending time to investigate this issue 👍🏼

I think the best option would be to use git lfs (https://docs.github.com/en/repositories/working-with-files/managing-large-files/configuring-git-large-file-storage) so that we can keep having the data in the consistent format. I'm not sure you'd be able to complete it because I think we need to wire up an external blob storage here (e.g., S3, Google Cloud Storage etc.).

seun-beta commented 1 year ago

Hello @anuveyatsu

Thank you for your response. I also researched Git LFS initially but the overall setup was a little too much.

An idea about using S3 and Boto3 just popped into my mind. When the workflow run is triggered based on the cron configuration, the code could push results into S3 directly.

What do you think about that?

mforsetti commented 1 year ago

Hello,

I've tried deploying Git LFS, and getting this error.

> [main eb55196] Auto-update of the data packages
>  8 files changed, 24 insertions(+), 8580484 deletions(-)
> batch response: @github-actions[bot] can not upload new objects to public fork mforsetti/covid-19
> error: failed to push some refs to 'https://github.com/mforsetti/covid-19'
> Error: Process completed with exit code 1.

Apparently Git LFS refuses to push against forks of non-Git LFS parent repo. See git-lfs/git-lfs#1906.

What about gzip-ing the generated CSVs? We can add gunzip-ing code into scripts/update_datapackage.py script.