Open ciyer opened 5 years ago
This is a great question/request - I think we need to have some more tooling around datasets in general. It seems to me that there are two general issues to solve:
1) how to tell the process running under renku run
about the files/data in the dataset
2) how to use the dataset information when processing lineage
The first could have a very simple first solution: write a temporary file somewhere with the list of files in the dataset. We could put the location of this file into an environment variable that is available to the process.
The second is a bit trickier. For example, I might have a dataset of "living" data files that get updated via a pipeline. I can reference the dataset with a flag like --input-dataset
but unless I also link the run dependencies to the individual files I won't see that downstream results need to be updated. Unless the dataset is invalidated if the files change and always refers to files at a specific commit/checksum.
Yes, a dataset needs to have a version/hash. This could be computed from the hash of each file it contains.
As discussed in the "2020-02-20 Renku Design meeting - Datasets" it'd be nice to have two axis of datasets, explicit
vs implicit
, and permanent
vs ephemeral
. Writing this down so it doesn't get lost.
explicit
= user created, with user added metadata
implicit
= generated by renku, not necessarily visible to the user, an internal abstraction.
And all renku run commands will act on datasets (maybe call it data collections and only use "Dataset" for explicit, user created ones), simplifying implementation on our end (no separate code paths for files or datasets.
The ephemeral/permanent dimension ties into renku run outputs being treated as a dataset, with ephemeral
ones just containing the metadata (and the workflow having to be re-run if someone else wants the data), and permanent
ones consisting of metadata+data/files. This way, we don't need to bundle code with datasets (as some other data repositories do it) but we can instead use input dataset
+ renku run workflow
--> output dataset
and treat the output dataset
as something that can be imported into other renku projects that will re-execute the code on that end when needed. And ephemeral
would help deal with cases where the data of a dataset can't be published but the metadata should still be in the KG.
From https://github.com/SwissDataScienceCenter/renku-python/issues/787 :
Renku users would like to use the Dataset
abstraction in other renku commands. The most obvious example is to provide the ability for something like:
renku run --input-dataset <dataset> <command>
In this case, renku should make the contents of <dataset>
available to <command>
.
Similarly, it would be desirable to enable this behavior:
renku run --output-dataset <dataset> <command>
Here, outputs would automatically be added to <dataset>
.
Is your feature request related to a problem? Please describe.
There is no way to express a dependency on a dataset in
renku run
. Inputs and outputs can only be files, not datasets.These leads to two main problems:
renku update
to detect that workflow needs to be rerun because a dataset has been appended toExample
There are a collection of related projects
https://dev.renku.ch/projects/search?q=advanced-tutorial&searchIn=groups
The
flights-data-[2017,2018,2019]
are data projects -- they contain data from one year. The projectflights-preprocess
performs some preprocessing on the data and produces a new dataset containing data for all years. The projecthttps://dev.renku.ch/projects/advanced-tutorial/global-delay-analysis
analyzes the data. I would like to be able to do the followingrenku run --input-dataset flights python analyze.py data/output/out.csv
in global-delay-analysis to process all current data and store the results in out.csvrenku status
and see that the data set used in out.csv is out of date.renku update
to fetch the latest data and rerun the analyze scripotDescribe the solution you'd like I'm not exactly sure what the best solution is, but I would like to open up a discussion.