SwissDataScienceCenter / renku-python

A Python library for the Renku collaborative data science platform.
https://renku-python.readthedocs.io/
Apache License 2.0
37 stars 29 forks source link

Make datasets usable in/creatable by workflows #706

Open ciyer opened 5 years ago

ciyer commented 5 years ago

Is your feature request related to a problem? Please describe.

There is no way to express a dependency on a dataset in renku run. Inputs and outputs can only be files, not datasets.

These leads to two main problems:

Example

There are a collection of related projects

https://dev.renku.ch/projects/search?q=advanced-tutorial&searchIn=groups

The flights-data-[2017,2018,2019] are data projects -- they contain data from one year. The project flights-preprocess performs some preprocessing on the data and produces a new dataset containing data for all years. The project https://dev.renku.ch/projects/advanced-tutorial/global-delay-analysis analyzes the data. I would like to be able to do the following

Describe the solution you'd like I'm not exactly sure what the best solution is, but I would like to open up a discussion.

rokroskar commented 5 years ago

This is a great question/request - I think we need to have some more tooling around datasets in general. It seems to me that there are two general issues to solve:

1) how to tell the process running under renku run about the files/data in the dataset 2) how to use the dataset information when processing lineage

The first could have a very simple first solution: write a temporary file somewhere with the list of files in the dataset. We could put the location of this file into an environment variable that is available to the process.

The second is a bit trickier. For example, I might have a dataset of "living" data files that get updated via a pipeline. I can reference the dataset with a flag like --input-dataset but unless I also link the run dependencies to the individual files I won't see that downstream results need to be updated. Unless the dataset is invalidated if the files change and always refers to files at a specific commit/checksum.

ciyer commented 5 years ago

Yes, a dataset needs to have a version/hash. This could be computed from the hash of each file it contains.

Panaetius commented 4 years ago

As discussed in the "2020-02-20 Renku Design meeting - Datasets" it'd be nice to have two axis of datasets, explicit vs implicit, and permanent vs ephemeral. Writing this down so it doesn't get lost.

explicit = user created, with user added metadata implicit = generated by renku, not necessarily visible to the user, an internal abstraction.

And all renku run commands will act on datasets (maybe call it data collections and only use "Dataset" for explicit, user created ones), simplifying implementation on our end (no separate code paths for files or datasets.

The ephemeral/permanent dimension ties into renku run outputs being treated as a dataset, with ephemeral ones just containing the metadata (and the workflow having to be re-run if someone else wants the data), and permanent ones consisting of metadata+data/files. This way, we don't need to bundle code with datasets (as some other data repositories do it) but we can instead use input dataset + renku run workflow --> output dataset and treat the output dataset as something that can be imported into other renku projects that will re-execute the code on that end when needed. And ephemeral would help deal with cases where the data of a dataset can't be published but the metadata should still be in the KG.

Panaetius commented 3 years ago

From https://github.com/SwissDataScienceCenter/renku-python/issues/787 :

Renku users would like to use the Dataset abstraction in other renku commands. The most obvious example is to provide the ability for something like:

renku run --input-dataset <dataset> <command>

In this case, renku should make the contents of <dataset> available to <command>.

Similarly, it would be desirable to enable this behavior:

renku run --output-dataset <dataset> <command>

Here, outputs would automatically be added to <dataset>.