Open malloryfreeberg opened 5 years ago
Great idea @malloryfreeberg! +1
.npy is also missing from the ontology but required.
@hewgreen why is .npy required? we won't be getting any files in this format, SpaceTx works with it but everything will be converted to TIFF. In any case we should not be ingesting any data in .npy format
I've spoken to Jon Ison, one of the developers of EDAM, and we should be able to get new terms into the ontology if we need them. Batched terms with definitions would be preferable to a trickle of one-offs.
We can at least batch what know for the pipelines output now (we'll just take them from an analysis bundle). And maybe take an educated guess for imaging inputs/outputs now? @hewgreen @zperova
We need to ingest tiff and json right now. This may change when the rest fo the DCP has seen some imaging datasets and starfish have further defined their formats.
Yes, tiff and json for now. Starfish will guide the format choice later on.
@hewgreen or @zperova can one/both of you make the term + definition list Dani mentioned?
@malloryfreeberg @hewgreen @zperova just to clarify - I only need labels and definitions for terms that aren't currently in EDAM already. For any existing terms, a list of what you need would be helpful.
Here's the list (to start with): TIFF OME-TIFF JSON FASTQ TSV CSV TXT PDF PNG JPG BAM BAI SAM FASTA XML CRAM YAML XLS XLSX HDF5
Not in EDAM: ZARR MTX LOOM NPY
The suggestion in https://github.com/HumanCellAtlas/metadata-schema/issues/880 for adding a file_encoding
of "directory" might avoid needing all the granularity of EDAM.
Though JSON may still be needed for the codebook (https://github.com/HumanCellAtlas/metadata-schema/issues/542#issuecomment-470066724), it might be that the uploading of TIFF
files as a separate entity is discouraged due to lack of context. SPACETX
would then possibly be the best entry to go along with the multi-file ZARR
entries.
Alternatively, I could see having it be the more generic IMAGING
format, but that could eventually need differentiating between validators:
imaging
|__ spacetx
\__ bio-formats
|__ ...
\__ ome-tiff
New terms to add to EDAM under Format:
Zarr, Mtx, and Loom file formats have been requested: https://github.com/HumanCellAtlas/ontology/issues/35
@ESapenaVentura The first action in this ticket is done. We can also do the second action here (make the new file format ontology schema), although we can't change the file.file_core.format
from a string to an ontology object, yet. Can you take the second task, here? To make the new ontology schema.
Yes, of course!
Re-opening as one of the items is not done. This ticket was automatically closed by another PR being merged into master.
~I'll take care of the last item in the list so we can move on!~ I forgot we were on a metadata freeze!
We use the file_core schema when submitting analysis file metadata to ingest 🙂 Here are file types we submit that have not been mentioned yet:
For reference, here are the analysis files that we submit for our two production workflows: Optimus: https://api.ingest.data.humancellatlas.org/submissionEnvelopes/5d3b0c869be88c0008a9d714/files SS2: https://api.ingest.data.humancellatlas.org/submissionEnvelopes/5d3af8b19be88c0008a9d6ef/files
There are a few that say "format: unknown", which is something that we'll need to update on our end as long as we're submitting free-text format types.
@samanehsan, can you please give a brief summary of what those file formats are made of (e.g., are they text matrices, binary objects) for the .results
, .npy
and .npz
? Are they standard extension for files, or is it a custom extension to identify the results in our system?
About the csv.gz
file, there was an ongoing discussion about the encoding of the file and how to represent gzipped files. Right now we are acknowledging explicitly in the metadata if the file is compressed with this "format" field, but I don't think this is the right direction to move forward. The system should be able to identify compressed files even if it's not stated there.
Any @HumanCellAtlas/ingest dev have any idea about this topic?
can you please give a brief summary of what those file formats are made of (e.g., are they text matrices, binary objects) for the .results, .npy and .npz? Are they standard extension for files, or is it a custom extension to identify the results in our system
I believe .npy
and .npz
are standard NumPy file formats but I'm not sure about .results
. Any @HumanCellAtlas/pipelines-computational-biologists familiar with this?
The relevant .results
files are described here. But, they're just TSVs created by RSEM.
I believe we are no longer outputting the npy and npz files from Optimus, only Zarr
@kbergin could someone confirm that npy and npz have been deprecated as output formats? thanks!
@HumanCellAtlas/pipelines-computational-biologists Can someone confirm the above for me?
We probably still need the info about them since all the existing projects do have them though right?
Looking at a submission envelope created by today's integration test, optimus_v1.3.5 still outputs .npy
and .npz
files, specifically:
thanks @samanehsan
Is there an update on if/when file format will be ontologised? This came up on a FAIRplus call today.
If I have to guess we will ontologise it when we are able to evolve the schema again and there is capacity for it, but for now we don't have any estimate on when or how that would be
Did they talk about a specific ontology for file formats? are we on the right direction with EDAM?
@ESapenaVentura apologies, I didn't see your response earlier. Yes, this was specifically in relation to EDAM. FAIRplus intend to use the same ontology for file format validation stuff.
For which schema is a change/update being suggested?
I would like to request a major update to the
file_core.json
schema.What should the change/update be?
I would like to ontologize the
file_core.format
field. Currently it is a free text string field. Using OLS I can see terms in EDAM for fastq, JSON, and tiff, which are the main three primary data files types I can think of right now for sequencing and imaging experiments.format
field to use ontology schema (major change)I can also find terms for csv, tsv, and txt. There are likely more analysis file types that need coverage in the ontology (e.g. zarr, mtx).
What new field(s) need to be changed/added?
No new fields.
Why is the change requested?
Ontologizing this field will mean that datasets will have standardized values for displaying in the Browser.