CoronaWhy / task-geo

"Help us understand how geography affects virality."
MIT License
10 stars 20 forks source link

Add uploading of datasets. #52

Open ManuelAlvarezC opened 4 years ago

ManuelAlvarezC commented 4 years ago

Description

Trello card :https://trello.com/c/tb08vrGi

We need to upload the datasets generated by our data sources, to make them easily accesible to other teams.

To do we need to:

  1. Create a set of default parameters for each datasource, which may be empty.
  2. Create a function that given a data_source, runs it with the defined parameters, store the result as a csv and packs it in a folder with a copy of the audit and metapackage.json files.
  3. Make a function, using this notebook as a template that takes the path to a data package and uploads to kaggle.
  4. Create a function that take no arguments, and iterate through the data sources, generating the datapackages and uploading them to kaggle.
  5. Create a github action that runs the function of step 4 each 24 hours.
smartcaveman commented 4 years ago

Comments/Suggestions

ManuelAlvarezC commented 4 years ago

HI @smartcaveman, and thanks for your comments, let me answer along your quotes:

Comments/Suggestions

  • While we may only need CSV inputs for this repository at this point, but it's highly likely that a reusable solution would be robust enough to consume and produce additional formats (JSON, JSONLD, RDF, XML, YAML, etc). Understanding this, it would be wise to implement some extensibility here by parametrizing the formats to whatever function processes the data sources.

Indeed, that's why the data source specification demands the output to be a pandas.DataFrame, this way, changing format is one line, making it work with different parameters, a couple more.

  • The source for the library referenced by the sample notebook is at kaggle-storage-client. Please raise any issues on that project if this implementation identifies obstacles to its use. It was built with the intention of making this kind of thing easier.

Will do. Thanks.

  • Step 4 doesn't need to run if neither the code nor the data sources have changed since the last execution. To simplify this comparison, store the hash for source files.

Some of our datasets, like census, won't be changing, so it was already considered, although not written on the issue, that some data sources, shouldn't be running at every execution. Also, some other data sources, like meteo or covid cases, may have new data everyday, so it make sense to run them, even if the code hasn't changed.

However, I will make sure that when, let's call them "static" data_sources, they have their code updated will be executed too.

  • Step 4 could become reusable to other teams' datasets if we parameterize either (1) the set of data sources; or, (2) a configuration file containing the set of data sources

This is an interesting remark. Yesterday I had a call with Anton, regarding this issue, and have had that in mind while thinking in the design of the solution.

  • W3C DCAT describes recommended semantics for describing aggregations of datasets. This is not necessary to consider yet, but may be helpful down the line as these processes become more complex.
  • dataflows and datapackage-pipelines may provide useful examples for what a mature, generic implementation of this kind of process might look like.

I will check them in more detail during the weekend if I have time. Definitely they look really interesting.

smartcaveman commented 4 years ago

@ManuelAlvarezC sorry, my comment was ambiguous. Re: "To simplify this comparison, store the hash for source files.", I was referring to source code and data source files. So, if you download a large dataset that's expensive to process, and the hash of the dataset is the same as it was last time it was processed, then it doesn't need to be reprocessed.

hyberson commented 4 years ago

Locate and create code to extract Granular COVID data for South Korea