Ontologize file_core.format field.

malloryfreeberg commented 5 years ago

For which schema is a change/update being suggested?

I would like to request a major update to the file_core.json schema.

What should the change/update be?

I would like to ontologize the file_core.format field. Currently it is a free text string field. Using OLS I can see terms in EDAM for fastq, JSON, and tiff, which are the main three primary data files types I can think of right now for sequencing and imaging experiments.

[x] Request needed terms to be added to EDAM ticket
[x] Add file format ontology schema (new schema)
[ ] Update format field to use ontology schema (major change)

I can also find terms for csv, tsv, and txt. There are likely more analysis file types that need coverage in the ontology (e.g. zarr, mtx).

What new field(s) need to be changed/added?

No new fields.

Why is the change requested?

Ontologizing this field will mean that datasets will have standardized values for displaying in the Browser.

hewgreen commented 5 years ago

Great idea @malloryfreeberg! +1

.npy is also missing from the ontology but required.

zperova commented 5 years ago

@hewgreen why is .npy required? we won't be getting any files in this format, SpaceTx works with it but everything will be converted to TIFF. In any case we should not be ingesting any data in .npy format

daniwelter commented 5 years ago

I've spoken to Jon Ison, one of the developers of EDAM, and we should be able to get new terms into the ontology if we need them. Batched terms with definitions would be preferable to a trickle of one-offs.

malloryfreeberg commented 5 years ago

We can at least batch what know for the pipelines output now (we'll just take them from an analysis bundle). And maybe take an educated guess for imaging inputs/outputs now? @hewgreen @zperova

hewgreen commented 5 years ago

We need to ingest tiff and json right now. This may change when the rest fo the DCP has seen some imaging datasets and starfish have further defined their formats.

zperova commented 5 years ago

Yes, tiff and json for now. Starfish will guide the format choice later on.

malloryfreeberg commented 5 years ago

@hewgreen or @zperova can one/both of you make the term + definition list Dani mentioned?

daniwelter commented 5 years ago

@malloryfreeberg @hewgreen @zperova just to clarify - I only need labels and definitions for terms that aren't currently in EDAM already. For any existing terms, a list of what you need would be helpful.

zperova commented 5 years ago

Here's the list (to start with): TIFF OME-TIFF JSON FASTQ TSV CSV TXT PDF PNG JPG BAM BAI SAM FASTA XML CRAM YAML XLS XLSX HDF5

Not in EDAM: ZARR MTX LOOM NPY

joshmoore commented 5 years ago

The suggestion in https://github.com/HumanCellAtlas/metadata-schema/issues/880 for adding a file_encoding of "directory" might avoid needing all the granularity of EDAM.

Though JSON may still be needed for the codebook (https://github.com/HumanCellAtlas/metadata-schema/issues/542#issuecomment-470066724), it might be that the uploading of TIFF files as a separate entity is discouraged due to lack of context. SPACETX would then possibly be the best entry to go along with the multi-file ZARR entries.

Alternatively, I could see having it be the more generic IMAGING format, but that could eventually need differentiating between validators:

 imaging
  |__ spacetx
  \__ bio-formats
      |__ ...
      \__ ome-tiff

malloryfreeberg commented 5 years ago

New terms to add to EDAM under Format:

Zarr - The Zarr format is an implementation of chunked, compressed, N-dimensional arrays for storing data. (citation)
MTX - The Matrix Market matrix format stores numerical or pattern matrices in a dense (array format) or sparse (coordinate format) representation. (citation)
LOOM - The Loom file format is based on HDF5, a standard for storing large numerical datasets. The Loom format is designed to efficiently hold large omics datasets. Typically, such data takes the form of a large matrix of numbers, along with metadata for the rows and columns. (citation)

malloryfreeberg commented 5 years ago

Zarr, Mtx, and Loom file formats have been requested: https://github.com/HumanCellAtlas/ontology/issues/35

malloryfreeberg commented 5 years ago

@ESapenaVentura The first action in this ticket is done. We can also do the second action here (make the new file format ontology schema), although we can't change the file.file_core.format from a string to an ontology object, yet. Can you take the second task, here? To make the new ontology schema.

ESapenaVentura commented 5 years ago

Yes, of course!

malloryfreeberg commented 5 years ago

Re-opening as one of the items is not done. This ticket was automatically closed by another PR being merged into master.

ESapenaVentura commented 5 years ago

~I'll take care of the last item in the list so we can move on!~ I forgot we were on a metadata freeze!

samanehsan commented 5 years ago

We use the file_core schema when submitting analysis file metadata to ingest 🙂 Here are file types we submit that have not been mentioned yet:

.results
.csv.gz
.npy
.npz

For reference, here are the analysis files that we submit for our two production workflows: Optimus: https://api.ingest.data.humancellatlas.org/submissionEnvelopes/5d3b0c869be88c0008a9d714/files SS2: https://api.ingest.data.humancellatlas.org/submissionEnvelopes/5d3af8b19be88c0008a9d6ef/files

There are a few that say "format: unknown", which is something that we'll need to update on our end as long as we're submitting free-text format types.

ESapenaVentura commented 5 years ago

@samanehsan, can you please give a brief summary of what those file formats are made of (e.g., are they text matrices, binary objects) for the .results, .npy and .npz? Are they standard extension for files, or is it a custom extension to identify the results in our system?

About the csv.gz file, there was an ongoing discussion about the encoding of the file and how to represent gzipped files. Right now we are acknowledging explicitly in the metadata if the file is compressed with this "format" field, but I don't think this is the right direction to move forward. The system should be able to identify compressed files even if it's not stated there.

Any @HumanCellAtlas/ingest dev have any idea about this topic?

samanehsan commented 5 years ago

can you please give a brief summary of what those file formats are made of (e.g., are they text matrices, binary objects) for the .results, .npy and .npz? Are they standard extension for files, or is it a custom extension to identify the results in our system

I believe .npy and .npz are standard NumPy file formats but I'm not sure about .results. Any @HumanCellAtlas/pipelines-computational-biologists familiar with this?

mckinsel commented 5 years ago

The relevant .results files are described here. But, they're just TSVs created by RSEM.

kbergin commented 5 years ago

I believe we are no longer outputting the npy and npz files from Optimus, only Zarr

zperova commented 5 years ago

@kbergin could someone confirm that npy and npz have been deprecated as output formats? thanks!

kbergin commented 5 years ago

@HumanCellAtlas/pipelines-computational-biologists Can someone confirm the above for me?

mshadbolt commented 5 years ago

We probably still need the info about them since all the existing projects do have them though right?

samanehsan commented 4 years ago

Looking at a submission envelope created by today's integration test, optimus_v1.3.5 still outputs .npy and .npz files, specifically:

sparse_counts.npz
sparse_counts_col_index.npy
sparse_counts_row_index.npy

zperova commented 4 years ago

thanks @samanehsan

daniwelter commented 4 years ago

Is there an update on if/when file format will be ontologised? This came up on a FAIRplus call today.

ESapenaVentura commented 4 years ago

If I have to guess we will ontologise it when we are able to evolve the schema again and there is capacity for it, but for now we don't have any estimate on when or how that would be

Did they talk about a specific ontology for file formats? are we on the right direction with EDAM?

daniwelter commented 4 years ago

@ESapenaVentura apologies, I didn't see your response earlier. Yes, this was specifically in relation to EDAM. FAIRplus intend to use the same ontology for file format validation stuff.

HumanCellAtlas / metadata-schema

Ontologize file_core.format field. #812