Human-readable/user-defined filename (& path) for datasets

Currently Galaxy (ObjectStore) organizes files (representing datasets) under admin-defined paths following [0-9]{3}(\/dataset_)[0-9] pattern. For instance:

├── database
│   ├── files1
│   │   ├── 000
│   │   │   ├── dataset_1.dat
│   │   │   ├── dataset_2.dat
│   │   │   ├── dataset_3.dat
...
AWS S3
├── my_bucket
│   ├── files2
│   │   ├── 000
│   │   │   ├── dataset_4.dat
│   │   │   ├── dataset_5.dat
│   │   │   ├── dataset_6.dat

where some datasets are stored locally, and some are stored in an AWS S3 bucket all under 000 folder and named dataset_X.dat, where X is the dataset id.

There are a number of advantages to this pattern of naming/structuring files; for instance:

absolute file path can be generated dynamically and on-the-fly; e.g., see: https://github.com/galaxyproject/galaxy/blob/d566ed9b8b67aec270b92ad5ef443d972f598fa1/lib/galaxy/objectstore/__init__.py#L376-L377
follows best-practice guidelines recommending against encoding metadata in filenames;
by obfuscating filenames and organizing them under specific structure, it mildly enforces data immutability and consistency (i.e., data stored in a file has remained unchanged since it was last accessed), hence ensuring reproducibility of analysis.

However, in spite of its advantages, this pattern comes with some disadvantages. For instance:

It introduces challenges to mount data on Galaxy, analyze them, and push results back to a folder (e.g., same as the mounted folder). For instance, a user who has their data in a S3 bucket would like to mount that bucket into Galaxy and analyze all the files (objects) inside that bucket. Currently, one of the challenges is that filenames do not adhere with Galaxy's expected naming pattern (at least from ObjectStore's perspective).
it prevents users from naming files as it makes more sense to them. For instance, often users encode some metadata (e.g., anti-body name, experiment duration, tissue name, or lab name) in filenames and write scripts that parse the filenames and run appropriate post processing according to the metadata in the filenames. With obfuscating filenames, we introduce challenges to users correlating input and output files.

Galaxy (ObjectStore) abstracts datasets from persisted files, where users are in control of datasets (can create/use/share/delete/purge them) and Galaxy is in control of files ensuring their accessibility, consistency, and immutability. One way to enforce Galaxy's exclusive control on files is the aforementioned naming pattern. However, for various reasons (such as those aforementioned), it is advantageous to allow users to be in control of files, where Galaxy can populate a history by reading files from a user-specified folder, regardless of how they are named, and without having to duplicate them only for adherence with Galaxy (ObjectStore) required naming pattern.

There are some challenges to enable this feature via ObjectStore; for instance:

Objectstore needs to be able to operate on on a per user basis (see PR: https://github.com/galaxyproject/galaxy/pull/4840);
Objectstore should be able to read and write datasets from/to user-specified filenames; maybe leveraging external_filenames and _extra_files_path: https://github.com/galaxyproject/galaxy/blob/d566ed9b8b67aec270b92ad5ef443d972f598fa1/lib/galaxy/model/mapping.py#L210-L224
ensure data consistency/immutability via files checksum; see PRs https://github.com/galaxyproject/galaxy/pull/4659 https://github.com/galaxyproject/galaxy/pull/7487
be able to mount a folder or a cloud-based bucket in ObjectStore, e.g., via FUSE or symlink.

ping @jgoecks @jmchilton @martenson @dannon @afgane @luke-c-sargent @natefoo

Galaxy (ObjectStore) abstracts datasets from persisted files, where users are in control of datasets (can create/use/share/delete/purge them) and Galaxy is in control of files ensuring their accessibility, consistency, and immutability.

It is a seemingly small point but has huge architectural implications that users do not control the datasets but the dataset associations and metadata. If users controlled the datasets and not the dataset associations - we have to re-architect dataset copying, sharing, etc.. - we need new abstractions, API, and UIs for how these things would work I think.

The two questions we want to answer I think are:

Do we want to have a variant (subclass, flag, etc..) of galaxy.model.Dataset that does indeed belong to a user (or group)? I heard a lot of yeses at the meeting but it didn't seem to be a consensus.
If yes, what does that look like at the database layer, the model layer, the API, and the UI. Some things are obvious - like if a user copies a dataset that is truly owned by another user - we need to make a physical copy of the data.

If there is a architecture for this working - I guess I'd prefer to see it done and doable with existing object store constructs and models first.

If there was a connection between the Dataset and a User in the database and we implemented this throughout the app - then it seems like the pluggable media stuff would fit right in and seem natural.

I think solving all of that is a precursor to thinking about files instead of datasets? If we could say "this jobs outputs should belong to user X", then we could create the datasets in that fashion put them where they need to be that user's object store, etc..

galaxyproject / galaxy

Human-readable/user-defined filename (& path) for datasets #7525