Update metadata file modeling to support data types stored as directories

diekhans commented 5 years ago

Problem statement:

The addition of the zarr-based imaging data format has challenged the current design of the metadata file entities. The purpose of this ticket is to design changes to the metadata model to reflect the ongoing challenges based on the storage of multi-file data types.

Before anyone panics: this is not a radical change to the metadata structure, this is mostly a rigorous definition of what the "file" metadata entities represent and a clear partition of responsibility between metadata and the DSS.

The current metadata model provides a single metadata entity for each physical file that a user can download. It contains the file's name and checksum, but not the file's UUID and version. This information must be obtained based on the bundle that the particular version of the file entity was loaded.

A zarr is a container of array data, similar to HDF5. The storage is in the form of a file system directory tree. The user addresses the container through software using the top level directory, not the individual files. This is also analogous to a fastq file being a container for sequences. There could be tens of thousands of individual files within a single zarr container.

Representing each individual file in a zarr container as a metadata file entity will create a large amount of overhead. The main value storing checksums for the individual files. This validation support duplicates information that can be obtained from the DSS. The overhead could be reduced by combining all of the file entities for a given zarr into a single JSON file, it is doubtful that recording this level of granularity is useful in the data model.

Proposal:

This proposal changes the meaning of the file entities from having a one-to-one representation of each physical data file to representing files or directories that are containers of data. This works well with the UNIX model of directories are files. A metadata entity represents an atomic unit. If any component of the container changes, this is considered a new version of the entire container.

The responsibility of the metadata files entities is to describe the container:

data type
encoding
file or directory name that is used to access the container
UUID and version (FQID) of the file or directory
natural language descriptions
other attributes that are not encoded in the container that maybe required to access the container
provenance of the file or directory

The DSS's responsibility is to track the physical attributes of the file or directory:

track content, including content of directories
storing total size
storing checksums
tracking bundles contain file
deduplication when new versions are created

The following changes to file_core will support this:

file_format - The data format of the file [existing field]. As proposed in issue Update file_format field from "fastq.gz" to "fastq", this field will be clarified to represent the actually data format, less encoding. For the zarr directories, the value would be "zarr".
file_encoding - New field describing how the file is encoded [new field]. For directory based containers, the value would be "directory".
file_fqid - The file UUID and version [new field]. This is used to query the DSS.
checksum - The checksum is now the responsibility of the DSS [deprecated and phased out].

Related issues

tburdett commented 5 years ago

The current metadata model provides a single metadata entity for each physical file that a user can download.

Or upload. There are critical data lifecycle events that run during ingestion of data that hang off events triggered when submitters upload a file, and depend on metadata for each uploaded data file.

Representing each individual file in a zarr container as a metadata file entity will create a large amount of overhead. The main value storing checksums for the individual files.

Assuming files in a zarr container are uploaded individually into the upload area, the main value to the ingestion service is actually critical. Ingest must know the address (the cloudUrl) for each individual file in order to attach it to the appropriate processes that generated them and thereby link it to the biomaterial metadata. Without this, I don't think it would be possible for ingest to know which files needed to be exported to the DSS.

Treating files that are part of zarr containers differently to other sorts of file requires that ingest and upload services bake in awareness of the semantics of zarr, or any other supported "container" formats. I can't think how one might do this without storing some individual file level metadata ("uploaded file foo.zattr is part of my zarray").

The overhead could be reduced by combining all of the file entities for a given zarr into a single JSON file, it is doubtful that recording this level of granularity is useful in the data model.

Given that it is the ingestion service that assigns UUIDs/FQIDs for each uploaded file, ingest (at least) does need to care about this level or granularity. I agree most consumers of the data will not.

The problems this proposal causes to ingestion and upload services are significant (for these services, this is a radical proposal!). I do think those problems can be offset by allowing containers to be uploaded in a archive format (e.g. just upload and describe in metadata a tarball). This has the advantages of making upload massively more straightforward for submitters, retaining wholesale most aspects of this proposal and keeping lifecycle events simple for upload/ingest. The disadvantage is that to do this and meet consumer usecases, we would have to determine which bits of the DCP are responsible for unpacking archive formats for presentation to consumers.

But I think this could be quite simple. And could be solved by the proposed file_encoding=directory flag being used by the DSS/data browser/CLI to detect and dynamically unpack archive contents during storage/presentation

diekhans commented 5 years ago

The current metadata model provides a single metadata entity for each physical file that a user can download.

Or upload. There are critical data lifecycle events that run during ingestion of data that hang off events triggered when submitters upload a file, and depend on metadata for each uploaded data file.

If the directory container is considered atomic (which is really key to not tracking it in the metadata), when events would need to be generated on the container when all files are uploaded, not the individual files. Just as if it was a tar file.

I believe it would be up to the DSS to provide a mechanism to do this.

Maybe it is uploaded as a tar file and then the DSS extracts and de-dups in the case of an update.

diekhans commented 5 years ago

Treating files that are part of zarr containers differently to other sorts of file requires that ingest and upload services bake in awareness of the semantics of zarr, or any other supported "container" formats. I can't think how one might do this without storing some individual file level metadata ("uploaded file foo.zattr is part of my zarray").

For any of this to work, ingest would have to treat the directory and file hierarchy it contains as a single, logical file. No different than a treating a fastq as a container of reads.

I would hope that ingest should not have to bake in any knowledge of the content of the zarr, just that there is a directory hierarchy. Otherwise, this is a stupid ticket ;)

diekhans commented 5 years ago

Given that it is the ingestion service that assigns UUIDs/FQIDs for each uploaded file, ingest (at least) does need to care about this level or granularity. I agree most consumers of the data will not.

For this to work, the upload would assign an FQID only to the directory that it is sending. It would be entirely up to the DSS to assign FQIDs to each file. The download would be the same. The container is just the directory instead of a tar file. Directory tree would be a DSS concept.

Essentially s3 sync of the hierarchy instead of a single file.

diekhans commented 5 years ago

Assuming files in a zarr container are uploaded individually into the upload area, the main value to the ingestion service is actually critical. Ingest must know the address (the cloudUrl) for each individual file in order to attach it to the appropriate processes that generated them and thereby link it to the biomaterial metadata. Without this, I don't think it would be possible for ingest to know which files needed to be exported to the DSS.

Ingest would only link a process to the entire zarr, not any individual files within it. Ingest should/could not know or care about the internal structure. it's just bytes that happened to be spread out over multiple files instead of in one.

The DSS needs to make a directory behaving just like a single file or this gets very hard.

joshmoore commented 5 years ago

Ignoring any of the issues that @tburdett brings up, :+1: for file_encoding=directory, though it may suggest slightly different values for https://github.com/HumanCellAtlas/metadata-schema/issues/812 ; I'll comment there separately.

I do think those problems can be offset by allowing containers to be uploaded in a archive format (e.g. just upload and describe in metadata a tarball). (@tburdett)

Thinking about the user impact, there will be cases where users already have directory-style filesets that have been archived into a single. Supporting commands for archived and unarchived input would be appreciated to reduce the need for creating local duplicates which can be prohibitive. e.g:

one of (a) tar zcvf - /data | dcp upload - or (b) dcp upload /data as well as
one of (c) dcp upload data.tar.gz or (d) tar -xOzf data.tar.gz | dcp upload

Could file metadata contain an archive_format flag? (@tburdett)

I could also see using tar, tar.gz, zip, etc. as the file_encoding since if the DCP receives any of these, there's the potential for there being multiple files, making it ultimately a "directory" use case.

malloryfreeberg commented 5 years ago

Responding to the schema change suggestions:

file_format - The data format of the file [existing field]. As proposed in issue Update file_format field from "fastq.gz" to "fastq", this field will be clarified to represent the actually data format, less encoding. For the zarr directories, the value would be "zarr".

Agreed. Would need an update to the documentation of this field and added validation in ingest.

file_encoding - New field describing how the file is encoded [new field]. For directory based containers, the value would be "directory".

👍 I like this suggestion, as it appears to solve a at least 2 confusions/issues (fastq vs. fastq.gz; single file vs. directory)

file_fqid - The file UUID and version [new field]. This is used to query the DSS.

A FQID field should live in the provenance schema alongside the document_id (a UUID) field and the eventual new version field (currently the update_date field). All metadata entities - not just files - have FQIDs and are assigned by ingest, which means these ID/version fields should live in provenance.json (they then get imported by and stored in all type entity JSONs).

checksum - The checksum is now the responsibility of the DSS [deprecated and phased out].

Happy to deprecate this field, as long as the DCP can still support validating checksums between client and upload service during upload via the hca cli. I don't know how this currently happens, but this metadata field was a first attempt for supporting this requirement.

rolando-ebi commented 5 years ago

w.r.t checksums in metadata

If we want validation of checksums between user file-system -> upload-service, we'll need some user provided checksums. Might be best to build this into the hca cli tool, yeah (i.e have the hca cli tool generate checksums, or accept user-provided checksums, prior to upload and confirm them after upload)

diekhans commented 5 years ago

Note that is related to, but also may be different than #623. This describes an archive stored as a directory tree where the contents are opaque to the user. #623 could cover cases where the files are grouped, but the individual files and their meaning is transparent to the users, such as a pair of FASTQ files.

diekhans commented 5 years ago

In DSS call today:

It is suspect that the DSS may be able to support this already once `/' are allowed in file names, however, need to understand more the interaction with ingest
@kislyuk suggested that storing a zarr as a single bundle might be a good approach.
@diekhans needs to talk to @sampierson and Parth to understand original concepts behind ingest
Need to talk with people dealing with imaging data.

tburdett commented 5 years ago

Catching up - @joshmoore I personally love this suggestion:

Supporting commands for archived and unarchived input would be appreciated to reduce the need for creating local duplicates which can be prohibitive. e.g:

one of (a) tar zcvf - /data | dcp upload - or (b) dcp upload /data as well as

one of (c) dcp upload data.tar.gz or (d) tar -xOzf data.tar.gz | dcp upload

It makes total sense to me that you might legitimately want to support both cases, so the difference is explicit in metadata/media type (file_encoding=directory vs file_encoding=tar.gz)

This also speaks to @diekhans' point:

The DSS needs to make a directory behaving just like a single file or this gets very hard.

tburdett commented 5 years ago

I'll try to keep implementation details out of this thread, as this ticket should be to define what makes the most sense to define in the metadata schema to drive our usecases. But seeing as I mentioned some of the implementation issues already, I will summarise anyway... (!)

My major concern is to treat many files in a directory as a single S3 bucket object. It's obviously possible to "interpret" the directory structure in an S3 bucket based on file name prefixes, but upload and ingest won't be able to handle this nicely - I'll explain why below. Simpler is to package into one object (and do so in a manner that is transparent to the user). As @diekhans says:

Maybe it is uploaded as a tar file and then the DSS extracts and de-dups in the case of an update.

Upload notifies ingest of every file a user uploads, and metadata for each of these objects is tracked by ingest and compared to the user supplied manifest (provided in the spreadsheet). Assuming some sort of recursive upload of all files in a directory (dcp upload /data), this will result in n files in the upload bucket, with n checksums and cloud urls. Ingest/Upload will have to map these n files to the single user-provided metadata (for /data) and maintain this mapping during export to the DSS. This is not difficult in principle, but comes with several concerns e.g. How do you ensure the upload connection wasn't lost partway through? How do you checksum the directory?

Instead, I'd be in favour of the dcp upload /data command first packaging the directory into a tarball (transparently to the user), then uploading with a dedicated media type (application/hca+directory). Ingest can use this to track the tarball, checksum and export it, and the DSS can use either the media type or associated metadata (file_encoding=directory) to determine that is should unpack the tarball for presentation to users

tburdett commented 5 years ago

I should also say - I am happy to make myself available for conversations about the "original concepts behind ingest", although I have no doubt @sampierson and @parthshahva will be reliable sources of information for upload/ingest interactions. Also tagging @justincc in case he wants to follow.

parthshahva commented 5 years ago

w.r.t checksums in metadata

If we want validation of checksums between user file-system -> upload-service, we'll need some user provided checksums. Might be best to build this into the hca cli tool, yeah (i.e have the hca cli tool generate checksums, or accept user-provided checksums, prior to upload and confirm them after upload)

This is already in progress. We'll be using a 5th checksum calculating during upload and on server side to ensure corruption-free uploads. We'll be adding the same 4 tags to the files on the server side to allow for export to dss

parthshahva commented 5 years ago

I should also say - I am happy to make myself available for conversations about the "original concepts behind ingest", although I have no doubt @sampierson and @parthshahva will be reliable sources of information for upload/ingest interactions. Also tagging @justincc in case he wants to follow.

I'm struggling to follow this thread, but I'm happy to chat about the repercussions of changes on upload <=> ingest

rolando-ebi commented 5 years ago

@diekhans From an ingest point of view, I don't see any problem with introducing metadata that marks certain files as being directory-like.

The following changes to file_core will support this:

I think those are sensible changes but I am confused about the exact purpose of fqid here.

Representing each individual file in a zarr container as a metadata file entity will create a large amount of overhead

This reads as if you're suggesting we stop representing each file in a zarr container as a file entity; suggesting that we instead just track the zarr, and then the contained files are tracked some other way. Can you clarify whether this is what you're suggesting, in addition to the file_core changes?

diekhans commented 5 years ago

The following changes to file_core will support this:

I think those are sensible changes but I am confused about the exact purpose of fqid here.

Currently, the version of the file being represented by the file entity can only be found indirectly by looking it up in the bundle manifest from which the file entity was obtained. This extra step makes the metadata model less clear and more cumbersome to work with. Adding FQID creates a clear boundary of responsibility between metadata and DSS. It also means the provenance of a file can be traced, and even the file obtained, without knowing or caring about the bundle structure

diekhans commented 5 years ago

Representing each individual file in a zarr container as a metadata file entity will create a large amount of overhead

This reads as if you're suggesting we stop representing each file in a zarr container as a file entity; suggesting that we instead just track the zarr, and then the contained files are tracked some other way. Can you clarify whether this is what you're suggesting, in addition to the file_core changes?

Yes, this is exactly what I am suggesting :-) The ability to treat a directory tree as an archive rather than individual files. Analogous to HDF5 or a tar file.

ESapenaVentura commented 3 years ago

Not sure if this is relevant anymore @clairerye ?

HumanCellAtlas / metadata-schema