Open ManuelAlvarezC opened 4 years ago
Comments/Suggestions
HI @smartcaveman, and thanks for your comments, let me answer along your quotes:
Comments/Suggestions
- While we may only need CSV inputs for this repository at this point, but it's highly likely that a reusable solution would be robust enough to consume and produce additional formats (JSON, JSONLD, RDF, XML, YAML, etc). Understanding this, it would be wise to implement some extensibility here by parametrizing the formats to whatever function processes the data sources.
Indeed, that's why the data source specification demands the output to be a pandas.DataFrame
, this way, changing format is one line, making it work with different parameters, a couple more.
- The source for the library referenced by the sample notebook is at kaggle-storage-client. Please raise any issues on that project if this implementation identifies obstacles to its use. It was built with the intention of making this kind of thing easier.
Will do. Thanks.
- Step 4 doesn't need to run if neither the code nor the data sources have changed since the last execution. To simplify this comparison, store the hash for source files.
Some of our datasets, like census, won't be changing, so it was already considered, although not written on the issue, that some data sources, shouldn't be running at every execution. Also, some other data sources, like meteo or covid cases, may have new data everyday, so it make sense to run them, even if the code hasn't changed.
However, I will make sure that when, let's call them "static" data_sources, they have their code updated will be executed too.
- Step 4 could become reusable to other teams' datasets if we parameterize either (1) the set of data sources; or, (2) a configuration file containing the set of data sources
This is an interesting remark. Yesterday I had a call with Anton, regarding this issue, and have had that in mind while thinking in the design of the solution.
- W3C DCAT describes recommended semantics for describing aggregations of datasets. This is not necessary to consider yet, but may be helpful down the line as these processes become more complex.
- dataflows and datapackage-pipelines may provide useful examples for what a mature, generic implementation of this kind of process might look like.
I will check them in more detail during the weekend if I have time. Definitely they look really interesting.
@ManuelAlvarezC sorry, my comment was ambiguous. Re: "To simplify this comparison, store the hash for source files.", I was referring to source code and data source files. So, if you download a large dataset that's expensive to process, and the hash of the dataset is the same as it was last time it was processed, then it doesn't need to be reprocessed.
Description
Trello card :https://trello.com/c/tb08vrGi
We need to upload the datasets generated by our data sources, to make them easily accesible to other teams.
To do we need to: