FAIRplus / FAIRPlus_squad2

an internal issue tracker (=todo list) for Squad team 2
3 stars 0 forks source link

UC8 - File grouping into datasets, Identifier consistency or common standard #58

Closed mcourtot closed 4 years ago

mcourtot commented 5 years ago

“Find and retrieve all data files of one (http://arrays.jnj.com/arrays/inhouse/cell_lines/human/expressionSetRma.Rda) or several transcriptomics assays based on in house identifiers.” Note: all previous notes and requirements apply. Also there is no distinction on the platforms or technologies. The results can be a mix of microarrays, sequencing, and other data file types. Functional Question: Retrieve all relevant microarray data files. Or sequencing or other. Or microarrays from GEO. It can be hard because: 1) When files are located in a single repository directory, without the assay information, I cannot find out which data files belong to a certain assay. Also assay information about the used platforms can be missing, assay groups can have meaningless labels (“1”, “2”, “3” etc ), or probes/variables can have useless labels (“1”, “2”, “3” etc ). 2) when files contain identifiers for variables, they may change between platforms, or they will change over time. It is hard to compare old and new identifiers. Sample name conventions can be imposed, but then again these would be imposed at experiment level. Please make FAIR recommendations. Extra Requirements: a single transcriptomics assay can consist of several data files. The new case requires that a system identifies all relevant data files related to each assay. It becomes helpful that all variables (‘things’ being measured such as microarray probes, sequences, protein targets of fluorescent probes) are consistent in all data files, or at least refer to standard terms. Concretely that would mean that, for example, all Affy probe IDs refer to gene symbols or Entrez IDs; that all sequencing strand IDs refer to gene symbols or Entrez IDs or EnsEMBL exon IDs. In practice only identifiers that link to public domain identifiers are allowed. The practical advantage is that it becomes possible to use public information for analysis across datasets.

Owner: Jean Marc Neefs