galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.38k stars 999 forks source link

Human-readable/user-defined filename (& path) for datasets #7525

Open VJalili opened 5 years ago

VJalili commented 5 years ago

Currently Galaxy (ObjectStore) organizes files (representing datasets) under admin-defined paths following [0-9]{3}(\/dataset_)[0-9] pattern. For instance:

├── database
│   ├── files1
│   │   ├── 000
│   │   │   ├── dataset_1.dat
│   │   │   ├── dataset_2.dat
│   │   │   ├── dataset_3.dat
...
AWS S3
├── my_bucket
│   ├── files2
│   │   ├── 000
│   │   │   ├── dataset_4.dat
│   │   │   ├── dataset_5.dat
│   │   │   ├── dataset_6.dat

where some datasets are stored locally, and some are stored in an AWS S3 bucket all under 000 folder and named dataset_X.dat, where X is the dataset id.

There are a number of advantages to this pattern of naming/structuring files; for instance:

However, in spite of its advantages, this pattern comes with some disadvantages. For instance:

Galaxy (ObjectStore) abstracts datasets from persisted files, where users are in control of datasets (can create/use/share/delete/purge them) and Galaxy is in control of files ensuring their accessibility, consistency, and immutability. One way to enforce Galaxy's exclusive control on files is the aforementioned naming pattern. However, for various reasons (such as those aforementioned), it is advantageous to allow users to be in control of files, where Galaxy can populate a history by reading files from a user-specified folder, regardless of how they are named, and without having to duplicate them only for adherence with Galaxy (ObjectStore) required naming pattern.

There are some challenges to enable this feature via ObjectStore; for instance:

ping @jgoecks @jmchilton @martenson @dannon @afgane @luke-c-sargent @natefoo

jmchilton commented 5 years ago

Galaxy (ObjectStore) abstracts datasets from persisted files, where users are in control of datasets (can create/use/share/delete/purge them) and Galaxy is in control of files ensuring their accessibility, consistency, and immutability.

It is a seemingly small point but has huge architectural implications that users do not control the datasets but the dataset associations and metadata. If users controlled the datasets and not the dataset associations - we have to re-architect dataset copying, sharing, etc.. - we need new abstractions, API, and UIs for how these things would work I think.

The two questions we want to answer I think are:

If there is a architecture for this working - I guess I'd prefer to see it done and doable with existing object store constructs and models first.

If there was a connection between the Dataset and a User in the database and we implemented this throughout the app - then it seems like the pluggable media stuff would fit right in and seem natural.

I think solving all of that is a precursor to thinking about files instead of datasets? If we could say "this jobs outputs should belong to user X", then we could create the datasets in that fashion put them where they need to be that user's object store, etc..