FAIRplus / FAIRPlus_squad2

an internal issue tracker (=todo list) for Squad team 2
3 stars 0 forks source link

Identify transcriptomics Minimal Information guidelines #9

Closed mcourtot closed 4 years ago

mcourtot commented 5 years ago

Is there an existing MI for transcriptomics data that we can assess resolute against? Goal is to provide a wish list metadata for upcoming datasets @agiani99 says it may be available in Oncotrack? May be worth checking with David Henderson. @weiguUL discussed this during WP1 F2F and trying to get access to the metadata which are not sensitive - will bring to squad 1 @nsjuty @weiguUL will also ask David for metadata checklist

proccaserra commented 5 years ago

@mcourtot for the sake of clarity, can you specify if MI stands for:

If the former, then I would recommend using FAIRsharing identifiers to link the relevant suitable standards in the field:

  1. Minimum Information about a Sequencing Experiment a checklist for anything using NGS, and which can be applied to RNA-Seq data.

  2. Sequence Read Archive XML an xml format to capture experimental metadata and accepted by US, EU, Japan,China (INSDC) repositories. SRA xml schema makes use of controlled terms for all enum attributes and makes the following requirements:

All objects: Name and/or unique Identifier

-study_type -Sample/TaxonID (Sample information: must be NCBITaxon ID) -Experiment/library_layout -Experiment/library_source -Experiment /library_strategy -Experiment /library_selection -Run/Instrument/model -Run/file_format: -Run/file_checksum_method -Run/file_checksum

If endgame = public release, such criteria ought to be met.

Furthermore, depending on organism or material being studies (plant, cell lines, environmental samples, additional requirements may appear (see ISAconfigurations and MiXS checklists for more information).

  1. MAGE-TAB a tab delimited format for experimental metadata accepted by EU ArrayExpress data repository.

  2. ISA-TAB/ISA-JSON a tab delimited format for experimental metadata, allowing multiomics datasets to be described. SRA compatible configurations for sequencing application ensure capability to deposit to public archives. ISA is a Galaxy (workflow engine) native data-type ISA is used by several publishers.

  3. Experimental Factor Ontology. A terminology artefact / application ontology used by EMBL-EBI ArrayExpress and Data Curation group for term harmonisation.

  4. FASTQ: the de-facto standard for sequence read data.

  5. BAM binary version of the SAM alignment file.

These standards cover the generic aspects to Transcriptomics data but not the specific metadata required to describe a specific biological system (say a cell, a patient, a tumour, a plant, a microorganism community).

mcourtot commented 5 years ago

@proccaserra will bring this up at the ReSOLUTE call later today and propose next step for identifying 'missing' metadata for FAIRness completeness @proccaserra has done work in expressing MI checklists - they will present at next call with Dominique

mcourtot commented 5 years ago

Waiting for ReSOLUTE @danidi and @ulo to join call to discuss this - expected June 12th