Metadata to describe the content of a File.

hewgreen commented 5 years ago

For which schema is a change/update being suggested?

~type/file/supplementary_file.json and~ core/file/file_core.json

What should the change/update be?

For file_core.json:

[x] add a field that describes the content of the file (content_description)
[x] add either ontology to describe valid values for this field (EDAM, EDAM-BIOIMAGING)

For supplementary_file.json:

[x] ~remove file_description which is now redundant with content_description~ - Moving to a new ticket as this is a major version change and we are in a metadata freeze at the moment.

What new field(s) need to be changed/added?

Field name: content_description
Field description: General description of the contents of the file.
Field type: string
Required: yes(?)
Example: protocol; read alignment; reference gene list; matrix; ...
CV or enum: Should be an ontology (EDAM, EDAM-BIOIMAGING)

For full discussion (requirements and options) please see the RFC.

Useful bit:

Add a ‘file content type’ ontology field onto the supplementary file entity (this is not the same as file format, this would be about information contained in the file). Previously we discussed typing the supplementary files themselves rather than the process/protocol but the field would be the same. Either a granular enum requiring updates as we are presented with new files or a more general typing. The discussion on the more general typing suggested this was of limited use. A protocol would not be required to determine what the supplementary file contained.

Why is the change requested?

Our current mechanism for describing a file (sequencing, analysis, supplementary, imaging) does not capture what it contains which is useful for downstream interpretation. The Browser could use this information to display to users what is actually in the files they are about to download. Currently, only the file extension is shown, but this is not sufficient information to help users decide if they are downloading the files they need/want. Showing the file name is also not sufficient.

We should get a full list of file contents uploaded by data contributors and produced by the analysis pipelines. These values should populate the enum and can be extended when the DCP encounters files with new content (e.g. imaging).

List of needed terms:

Raw data/metadata files from contributor:

New terms

Protocol - A procedure for carrying out a scientific experiment (citation). - EDAM term under Data - REQUESTED
Starfish File - for tiffs, zarrays and all data json related to a starfish formatted project. Not including the key file experiment.json that requires explicit tagging for the validator to work. missing from EDAM ontology - ON HOLD UNTIL MORE CLARITY
Starfish Experiment File - marker for a key file experiment.json. This file points to all other starfish files and is required for validation. missing from EDAM ontology - ON HOLD UNTIL MORE CLARITY

Existing terms

DNA sequence (raw) or Sequence - for describing raw fastq files
Sample annotation or Reference sample report - for describing extra sample or donor information
Electronic health record - for describing clinical information about donors
Experimental report - for describing any non-protocol, non-sample information about the experiment that generated data
~[Codebook]() -Extracted in metadata so explicit tag not needed. This is pointed to from experiment.json so explicit tag is not needed for validators to run.~

Processed data/metadata files from pipelines:

New terms

Alignment reference - One or a set of reference molecular sequences, structures, or profiles used for alignment of genomic or transcriptomic experimental data. - EDAM term under Data/Alignment - REQUESTED
Count matrix - A table of values representing summarised read counts per genomic region (e.g. gene, transcript). (modified citation) - EDAM term under Matrix - REQUESTED
Quality control metrics - Results of testing an output or a sample of an output against the desired specification or standard (modified from citation). - EDAM term under Report - REQUESTED

Existing terms

Sequence alignment - for describing bam/bai output files
~Genome index - for describing what?~

Miscellaneous (from matrix service? downloadable metadata?):

New terms

Existing terms

Gene expression matrix - for describing gene expression matrices (do we need this?) This may be in zarr format at the moment. This is different to zarr formatted image files generated by starfish.
Sequence metadata - for describing a downloaded, merged metadata file about sequencing data
Image metadata - for describing a downloaded, merged metadata file about imaging data

malloryfreeberg commented 5 years ago

To-do: Write up specific use cases of what types of extra files might request to be included in a project.

hewgreen commented 5 years ago

This ticket needs to be expanded to include type enum for all files. This property would live on file core. Use cases to follow from the browser team.

malloryfreeberg commented 5 years ago

@NoopDog @hannes-ucsc We are going to prioritize this ticket. Do let us know if there are any other specific requirements you have or use cases you'd like to be supported.

We are open to suggestions for the name of the field, the enum list, etc.

Also pinging @kbergin and @mckinsel for help with establishing a useful list of enums.

malloryfreeberg commented 5 years ago

I'm not sure whether to make this field required. Making it required means the Browser can depend on it for displaying information for file/manifest download. But, making it required means that data currently in the DS will be missing it and pipelines team will need to update their code to use it.

kbergin commented 5 years ago

We do plan to re-run all 10x data soon as we put in our new Optimus pipeline (within the quarter), so it's not unreasonable to add this metadata then. The SS2 ones we could also reanalyze, although for those I'd rather wait until we have the analysis bundle versioning in place as it would be the same pipeline and would thus be a version of the previous analysis.

On that note - do you think with the bundle versioning and updating we're planning in that RFC we'd be able to just update the analysis bundles with the new metadata information without needing to re-run the analysis? Saves the whole system a lot of hassle / money.

malloryfreeberg commented 5 years ago

do you think with the bundle versioning and updating we're planning in that RFC we'd be able to just update the analysis bundles with the new metadata information without needing to re-run the analysis?

Absolutely! This would be an ideal use case of the update procedure :)

malloryfreeberg commented 5 years ago

@claymfischer I'd like to push this change forward soon since it's a field used by both the pipelines and Azul. It will also greatly enhance the Browser to be able to display this field to users. I'd like to give everyone a heads up that we're doing this, and also ask for help from the pipelines team to come up with valid values for their analysis files. Can you please put it on the agenda for the Jan 28 metadata call? Thanks!

claymfischer commented 5 years ago

Adding now, thanks!

hannes-ucsc commented 5 years ago

@kbergin

do you think with the bundle versioning and updating we're planning in that RFC we'd be able to just update the analysis bundles with the new metadata information without needing to re-run the analysis?

@malloryfreeberg

Absolutely! This would be an ideal use case of the update procedure :)

I agree that this should be the goal. I don't see any reason it shouldn't work as far as Data Browser/Azul are concerned.

I have to admit that I didn't read the RFC yet. Until I do, I wanted to mention that I agree that this field should be an ontology with the current label alongside the term ID as usual. One use case that came up recently is being able to distinguish types of matrices. For example, something like filtered_feature_bc_matrix should be an ontology term so we accurately describe this output of a 10x pipeline.

kbergin commented 5 years ago

@malloryfreeberg I'm working with @jodihir later this week to produce more/better documentation on the outputs of our pipelines. This is a first step I'll need to take to be able to help you get a list of values for our analysis files. I agree ontologies are a great idea - @hannes-ucsc

malloryfreeberg commented 5 years ago

@hannes-ucsc @kbergin thanks for the feedback! We'll have a short discussion on the metadata call next Monday on this, but the next step is definitely to get a list of useful values.

malloryfreeberg commented 5 years ago

@kbergin @jodihir Any update on this ticket?

@simonjupp @daniwelter is there an existing ontology that might cover the values for this field?

simonjupp commented 5 years ago

There are ontologies that have terms for common file types and formats e.g. OBI, IAO and EDAM. Have a poke around in OLS to see what you find and then we can setup a process for importing/creating the terms you need in EFO.

jodihir commented 5 years ago

@kbergin and I have started working on the outputs documentation. Kylee is on vacation this week, so we'll get back to it when she returns.

malloryfreeberg commented 5 years ago

It looks like there might be some useful terms in EDAM:

But I don't see terms for things like "log output" or "filtered_feature_bc_matrix".

Looks like EDAM might suffice for a subset of terms, but we'll need that full list prior to figuring out what terms we might need to add to the ontology.

joshmoore commented 5 years ago

:+1: for Image as well, though the lower-level doesn't isn't necessarily relevant to the HCA. See also EDAM Bioimaging's Formats but they are also not complete.

dosumis commented 5 years ago

In VFB we use FOAF:image and record image content via FOAF:depicts.

e.g.

dosumis commented 5 years ago

BTW - the term 'confocal microscopy' here comes from FBbi - which we are using for imaging metadata. We already have one ontology module that references it. I agree that OBI could potentially provide terms like this, but it is currently quite poorly developed for imaging, whereas FBbi is well developed and of reasonably high quality.

e.g. compare https://www.ebi.ac.uk/ols/ontologies/fbbi/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FFBbi_00000251

with

https://www.ebi.ac.uk/ols/ontologies/obi/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FOBI_0002119

malloryfreeberg commented 5 years ago

@jodihir @kbergin any update on this? If you look at the bottom of the original ticket (List of needed terms), I started a list based on what I know of the terms I think we'd need. I'm especially interested in your thoughts on whether the "Processed data/metadata files from pipelines" list is complete, what else is needed, and if what I have there makes sense. But if you have any insight into the other two categories, would love to hear that as well :)

kbergin commented 5 years ago

Hi @malloryfreeberg I'm so sorry for the delay. It's been a really crazy few weeks. We are working on a list of outputs from our pipelines that you can check out while we work on it, here. We have been working on it but haven't been able to prioritize it with the rest of the team's work for review by the CBs yet. So I wouldn't make any changes in response to this list yet. We are very heads down on Optimus for the next few weeks, I don't know that we'll be able to finalize this quickly.

malloryfreeberg commented 5 years ago

@kbergin ok, thanks for the update :) I understand about Optimus, and I will probably poke again in a few weeks if I don't hear anything just so that I can keep on this for our own prioritization. Thanks!

kbergin commented 5 years ago

Thank you!!

On Fri, Mar 1, 2019 at 11:36 AM Mallory Freeberg notifications@github.com wrote:

@kbergin https://github.com/kbergin ok, thanks for the update :) I understand about Optimus, and I will probably poke again in a few weeks if I don't hear anything just so that I can keep on this for our own prioritization. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/HumanCellAtlas/metadata-schema/issues/542#issuecomment-468726722, or mute the thread https://github.com/notifications/unsubscribe-auth/AJAxJN5tRKIARk8DX5cKlrVEtHSHb0Ldks5vSVbwgaJpZM4WlcPO .

-- Kylee Degatano Senior Software Product Manager Human Cell Atlas Data Coordination Platform Data Sciences Platform, Broad Institute

joshmoore commented 5 years ago

Thinking about (quoting from the description):

Other imaging inputs?? - for describing imaging inputs
zarr files?? - for describing what?

after a recent conversation on slack, I think there's a pattern here for filesets (or trees of files) in which you want to talk about a collection and can often do it by using what I've referred to elsewhere as a "key file" as a proxy for the collection but then the rest of the files might not be worth further description. Certainly as a user I wouldn't want to be asked to fill out:

$ find . | grep -vE "[1-9]" | xargs printf "%-30s|\n"
...
./.zattrs                     | zarr attribute file (json)
./.zgroup                     | zarr group file (json)
./0                           | zarr chunk (raw)
./0/.zattrs                   | zarr attribute file (json)
./0/.zgroup                   | zarr group file (json)
./0/0                         | zarr chunk (raw)
./0/0/.zarray                 | zarr array file (json)
./0/0/.zattrs                 | zarr attribute file (json)
./0/0/0                       | zarr chunk (raw)

Similarly for the TIFFs and most of the JSON files in SpaceTx. experiment.json can certainly be treated as a key file like ./.zgroup here; codebook.json might be as well, though experiment.json points to it. Alternatively, from a collection of files, a piece of code could certainly detect and return all of the experiment.json files.

cc: @hewgreen

zperova commented 5 years ago

Here is what we get from contributors with the current SpaceTx format specification:

codebook.json - maps patterns of intensity in the channels and rounds of a field of view to target molecules experiment.json - links the data manifests and codebook together nuclei.json - links data manifest and nuclei images together nuclei-fov.json - contains information about field of view and each tile contained in it for nuclei images primary_images.json - links data manifest and primary images together primary_images-fov.json - contains information about field of view and each tile contained in it for primary images nuclei-fov.tiff - images of the nuclei primary_images-fov.tiff - primary images

hannes-ucsc commented 5 years ago

@joshmoore

I think there's a pattern here for filesets (or trees of files) in which you want to talk about a collection and can often do it by using what I've referred to elsewhere as a "key file" as a proxy for the collection but then the rest of the files might not be worth further description

What if there isn't such a key file? I think it would be preferable to put all related files into a directory (which the DSS is working on supporting) and then in describing the directory with metadata as if it were a single file.

hannes-ucsc commented 5 years ago

… unless the plan is to not even include the related files in the bundle. If only the key file is mentioned in the bundle, we don't need a directory inside the bundle.

diekhans commented 5 years ago

The current overal design has files as the atomic unit, both on the server and what the user sees. I believe to try to hack around that to support file groups is going to cause issues in the long run. There may be times in the future where we need user-friendly names for specific files. A user might wish to verify checksums.

I do believe there is a problem with the proliferation of file.josn with end the bundle that could be addressed by having the group entity containing the file entities.

joshmoore commented 5 years ago

Here is what we get from contributors (@zperova):

What do you foresee data submitters needing to do when uploading these json files?

What if there isn't such a key file? (@hannes-ucsc)

I only know of one quite old format for which one couldn't define a key file. But I agree, that dealing with the directory in its entirety would suffice, if not even be preferable. (Thinking about it, I realize depending on how the zarr is constructed it may also have "multiple key files" if there are no groups and only multiple top level arrays.)

A user might wish to verify checksums. (@diekhans)

The number of json files is fairly small for a single experiment but tens to hundreds of thousands of binary will occur.

diekhans commented 5 years ago

A user might wish to verify checksums. (@diekhans)

The number of json files is fairly small for a single experiment but tens to hundreds of thousands of binary will occur.

Sorry, I was referring to the metadata json files, not data json files. The current metadata schema requires a file-.json metadata file. Having hundreds of thousands of ile-.json files to keep the data files company would be a bit insane. My point is that is is also not good practice to not have metadata files to track these files. We need to stored the metadata differently.

:-(

hannes-ucsc commented 5 years ago

File checksums are tracked in the DSS. There is extensive machinery in place in Ingest, Upload and DSS to ensure integrity. There is no need to burden the metadata with that.

~Any objections to moving forward with using one metadata entity to describe a directory of files inside a bundle as currently being implemented by the DSS team?~

[Edit: wrong ticket]

diekhans commented 5 years ago

File checksums are tracked in the DSS. There is extensive machinery in place in Ingest, Upload and DSS to ensure integrity. There is no need to burden the metadata with that

The integrity example is on the user side. Do I still have all the files, what versions are they, etc. How are versioned tracked? Does the whole directory have a single version? Did my undergrad overwrite a file?

My concern with not with the DSS, but with the HCA "metadata" and how groups of files are modeled. The DSS is supposed to be agnostic to the metadata. The metadata really needs to be agnostic to the DSS as well. If a group of files is a blob, then the metadata and versioning model needs to reflect that. The update model needs to reflect it. You get a whole new blob if one file changes, etc.

I am not arguing for one metadata representation over another, but that it be clearly defined.

If it is, I am being clueless, please point me at it.

What does a bundle manifest look like?

> [Edit: wrong ticket] 🥇

hannes-ucsc commented 5 years ago

The DSS exposes the checksums in GET /bundles and GET /files. There is no need to track checksums in the metadata. End-to-end integrity is guaranteed by the DCP as a whole, independently of the metadata.

hannes-ucsc commented 5 years ago

What does a bundle manifest look like?

https://dss.data.humancellatlas.org/v1/bundles/ffffa79b-99fe-461c-afa1-240cbc54d071?version=2019-02-27T223320.296197Z&replica=aws

diekhans commented 5 years ago

I am sorry I used checksums as an example; it is clearly missing my point, which is about data modeling and representation by the metadata data.

But to beat the dead horse, the metadata does track the checsum:

https://github.com/HumanCellAtlas/metadata-schema/blob/master/json_schema/core/file/file_core.json

diekhans commented 5 years ago

The metadata currently does not contain enough information to obtain the checksum from the DSS, is that is the desired approach.

diekhans commented 5 years ago

What does a bundle manifest look like?

https://dss.data.humancellatlas.org/v1/bundles/ffffa79b-99fe-461c-afa1-240cbc54d071?version=2019-02-27T223320.296197Z&replica=aws

I meant to ask, what does the bundle manifest look like with a directory with 10000 files in in?

hannes-ucsc commented 5 years ago

metadata currently does not contain enough information to obtain the checksum from the DSS

I don't think that's true — if you know the bundle UUID.

I meant to ask, what does the bundle manifest look like with a directory with 10000 files in in?

I don't think it is practical to have giant JSON structures with 10000 or more files in it. Nevertheless, it is still useful to allow groups of, say less than 1000 files within a bundle. I've already proposed a solution for the imaging datasets but I am tired/afraid of repeating it.

Consider a AWS: They put limits on everything, sometimes these limits are painfully low. But it is those very limits that allow AWS to focus their scaling efforts on a limited number of, or even a single dimension. If a client's architecture pushes those limits, the client should consider modifying their architecture so it scales along the same dimension as AWS. The DSS team is pursuing scaling along two dimensions: number of files in a bundle AND number of bundles.

But to beat the dead horse, the metadata does track the checsum

It doesn't hurt to repeat the checksum in the metadata but the authority on checksum are the DSS GET /bundles and GET /files responses.

diekhans commented 5 years ago

metadata currently does not contain enough information to obtain the checksum from the DSS

I don't think that's true — if you know the bundle UUID.

The bundle UUID is not contained in the metadata; especially hard since the metadata is duplicated. If you actually try to construction the metadata graph, it requests some interesting merge of informatization in the metadata, links.json, and the metadata manifest. If we had versioned bundles, it would be much, much harder. Essentially, time traveling bundle history.

Yet, this is a requirement about being able to implement the FAIR goal.

This isn't a complaint about the DSS, it is a plea to get the components aligned and implementing the same archiecture.

hannes-ucsc commented 5 years ago

The bundle UUID is not contained in the metadata

The only way to get to the metadata is via a bundle so it is safe to assume that the bundle UUID is available. So again, given a bundle, you have everything. How do you get a bundle? You query the DSS or Azul.

Essentially, time traveling bundle history.

I think you are solving a problem 1% of the user's have. Most people want the current metadata. Including all versions in one graph is overkill. We need the latest graph and a way to look at older versions of the graph, not a single graph containing all versions of everything. Dare I say that we don't even need to give users a giant JSON graph? We should give them TSVs since they are asking for them and query APIs because we know better ;-)

plea to get the components aligned and implementing the same architecture.

More than pleas, we need concrete proposals and the authority to try and fail. The constant deliberation is agonizing.

diekhans commented 5 years ago

I don't think it is practical to have giant JSON structures with 10000 or more files in it

Looks like the current per-file bundle manifest entry is around ~500-600 bytes, including white space. So a 10000 file JSON structure, uncompress is like ~5mb. I routinely text 50 mb files into my text editor. Doesn't seem alarming to me.

Somewhere you have to track the files. As inefficient as JSON is, it doesn't seem to be near the point of inventing new mechanisms.

I am fine if this isn't explicitly represented in the metadata, as long as there is a clear path to follow to get the information for an exact version. That doesn't exist today .****

hannes-ucsc commented 5 years ago

I routinely text 50 mb files into my text editor. Doesn't seem alarming to me.

I think that analogy is lacking. Do you load hundreds of thousands of those files into your text editor? Do you do this concurrently for 100 users?

Somewhere you have to track the files.

You don't. You should talk to Josh about how the imaging sets and zarray stores work.

diekhans commented 5 years ago

I routinely text 50 mb files into my text editor. Doesn't seem alarming to me.

I think that analogy is lacking. Do you load hundreds of thousands of those files into your text editor? Do you do this concurrently for 100 users?

Good point. Not necessarily a touchdown, but shows the need for some thinking.

Somewhere you have to track the files.

You don't. You should talk to Josh about how the imaging sets and zarray stores work.

As long as the atomic boundary is the blob of files, then we don't have to track it beyond what an AWS bucket tracks, that is fine, If we provide a reproducible method to verify the blob is a particular version (say defined ordering of md5s of md5s), then we can describe this in the metadata as a single version.

If we have to update parts of the blob, then it becomes interesting. If it can be handled by the new blob a whole becoming a new version, then the blob is still an atomic unit to the metadata. Save space with de-dup is fine.

If the blob is not an atomic unit, then you need to track or you can't be FAIR.

[we should work in the same office so we can whiteboard this]

joshmoore commented 5 years ago

How are versioned tracked? Does the whole directory have a single version? Did my undergrad overwrite a file? (@diekhans)

The fact that an undergrad did overwrite a single file should be detectable, but the invalidation would apply to the entire blob of files. A single TIFF (or a typo in a single data json) impacts the entire dataset. (Similarly for a zarr)

I don't think it is practical to have giant JSON structures with 10000 or more files in it. (@hannes-ucsc)

Just so I can follow along, you're talking here about the metadata json as opposed to the data json files (which will do exactly the same thing)?

As long as the atomic boundary is the blob of files, (@diekhans)

It does periodically happen that one would want to correct a single file in the blob of files, but having that registered as a (de-duplicated) new blob seems sensible to me since it will change the results of analysis, etc.

zperova commented 5 years ago

Here is what we get from contributors (@zperova):

What do you foresee data submitters needing to do when uploading these json files?

Upload of json data files is no different than of the tiff data files. I am not sure I understand the question.

To bring this back to where I left off, I think that the user might want to download the codebook.json alone, while the experiment.json alone is not as useful without the data files themselves. Taken this I rather give direct access to the codebook.json file as well.

hannes-ucsc commented 5 years ago

I don't think it is practical to have giant JSON structures with 10000 or more files in it. (@hannes-ucsc)

Just so I can follow along, you're talking here about the metadata json as opposed to the data json files (which will do exactly the same thing)? (@joshmoore )

I think I got mixed up. I was talking about a bundle (the manifest JSON) that explicitly lists all the files in an imaging set or a metadata JSON that does that. I think both are unnecessary. I don't think large JSON data files are NOT a problem. [edit: added the NOT]

How are versioned tracked? Does the whole directory have a single version? Did my undergrad overwrite a file? (@diekhans)

The fact that an undergrad did overwrite a single file should be detectable, but the invalidation would apply to the entire blob of files. A single TIFF (or a typo in a single data json) impacts the entire dataset. (Similarly for a zarr)

One can't overwrite files in the DSS. One can only create a new version. One can supply the version when one PUTs a file. If one uses the same version for all files in a set, one can avoid files being silently updated.

diekhans commented 5 years ago

I chatted with Hannes about this today and am convinced that the File metadata for can just represent a complex directory of files like the zarray format, with just one analysis_file entity. It could have a file_type of "zarray" and file_encoding of "directory" (assuming this attribute is added. The DSS would be responsible for the content management.

However, we had a very interesting discussion/disagreement about the meaning of file entities. Does a version of the file entity have a one-to-one relationship to a version of a file, or is the version of the entity just describing the version of the data in the entity? With the file version information in the bundle. However, that should be subject of another thicket.

joshmoore commented 5 years ago

The fact that an undergrad did overwrite a single file should be detectable,

One can't overwrite files in the DSS.

I think this refers more to what happens pre-upload (as an AUDR scenario).

malloryfreeberg commented 5 years ago

https://docs.google.com/spreadsheets/d/1gKVN7s_66zwQSiZB-r3IbfN18raEArahWAD4wCZnE4c/edit#gid=0

joshmoore commented 5 years ago

Interesting. To comment on

"zarr store? sparse matrix?"

from the spreadsheet, the .zgroup and .zattrs files are plain json files. .zgroup is very restrictive, but is required to turn a directory into a zarr fileset. .zattrs is much more flexible and is intended for arbitrary user metadata.

malloryfreeberg commented 5 years ago

@kbergin @jodihir and progress on supplying a list of required file content descriptions/ontology terms?

HumanCellAtlas / metadata-schema