EBI-Metabolights / guides

0 stars 0 forks source link

Representation of more than one species in an MetaboLights study #20

Open korseby opened 4 years ago

korseby commented 4 years ago

Dear all,

We are currently doing a sDiv workshop in the field of Eco-Metabolomics as part of iDiv (https://www.idiv.de/en/smile). In Ecomet we are usually taking metabolite profiles of many species and are trying to find and identify patterns across different species.

How does MetaboLights handle studies with more than one species?

In the "samples" section, when I add different species to the "Organism" (with choosing a corresponding NCBI taxonomy term), they automatically appear in the study. So far so good.

Let's assume we have a study with 3 species "A", "B" and "C" and we have measured 100 metabolites in these 3 samples. How do we indicate in the MAF file that metabolite 20 belongs to species A and B, but not C? There is already a "species"-column in the MAF but how do we indicate that the metabolite belongs to one or more particular species?

We came up with the following two ideas: 1) With identification there is the possibility to have more than one identifier for one compound. These are concatenated by a "|". Would something similar be possible for species? 2) Another possibility we could think of would be to include a 0/1-matrix based on the species names which indicate the presence/absence.

At the workshop, we are preparing a study with profiles of more than 1000 species... Any hints how we could handle multiple species would be gladly welcomed.

Examples with similar study design:

https://www.ebi.ac.uk/metabolights/MTBLS520 or https://www.ebi.ac.uk/metabolights/MTBLS687

Best wishes, Steffen @sneumann , Pierre-Marie @oolonek, and Kristian

sneumann commented 4 years ago

Here is some more (historic) context, which might still exist in some peoples inboxes.

From: Steffen Neumann sneumann@ipb-halle.de
To: Kenneth Haug kenneth@ebi.ac.uk, "Peters, Kristian" Kristian.Peters@ipb-halle.de
Cc: "isatools@googlegroups.com" isatools@googlegroups.com, David Johnson david.johnson@oerc.ox.ac.uk, metabolights-curation metabolights-curation@ebi.ac.uk
Subject: [was: MAF export to Metabolights ] Multi-species MAF files
Date: Thu, 31 May 2018 21:46:40 +0200
We are currently preparing the following multi-species studies:

MTBLS544:  Plant-Soil Feedbacks introduce Changes in the Metabolome of common Grassland Species
MTBLS655:  Metabolite Profiling: Blood or Ketchup
MTBLS671:  Semi-polar exudates and their relation in natural grassland communitites
MTBLS679:  MacBeSSt
MTBLS687:  GC/MS untargeted Metabolomics of Root Exudates in Grassland Ecosystems

so the multi-species use case is becoming real and we still need 
a feasible recommendation about the "species" column in the MAFs.

We had the following suggestions, and I summarised a few 
pros (+) and cons (-):

A:Disentangle the MAF matrix so that for each species 
  there is one row with values in the samples of that species
  + clean correspondence between metabolite and species
  - creates huge / tall matrix 
  - unsuitable for statistics on the matrix

B:Concatenate the species with a separator like "|" as in 
  "Centaurea jacea|Festuca rubra|Holcus lanatus|Knautia arvensis|Geranium pratense|Poa pratensis|Leucanthemum vulgare|blank|Dactylis glomerata|Phleum pratense|Mix|Plantago lanceolata|Anthoxanthum odoratum|Avenula pubescens|standard|Ranunculus acris
  + MAF still usable for statistics
  + a feature is present in all species where the MAF intensity is not NA
  - if people use sth. like fillPeaks(), still unclear which feature 
    is present in which species.
  - if people use any kind of missing value imputation, 
    still unclear which feature is present in which species.

C:Leave species empty (or say "multiple" )
  + no false positive metabolite-species assignment
  + MAF still usable for statistics
  - no metabolite-species assignment at all

I lean towards option C, because A is unfeasible and B error-prone 
and might lead to false positive assignments. 

The proper solution would be to have multiple MAFs, 
one for the statistics as in C, and the other(s) 
one per species as in A (but only for the samples 
of that particular species).

Looking forward to comments.