bigbio / proteomics-sample-metadata

The Proteomics Experimental Design file format: Standard for experimental design annotation
GNU General Public License v2.0
75 stars 106 forks source link

Extending SDRF for Metabolomics #678

Open ypriverol opened 10 months ago

ypriverol commented 10 months ago

In the following issue #671 we discussed the major changes for SDRF specification during the next release 1.1. One of them, is the extension of SDRF for metabolomics.

GOAL: The aim of adapting SDRF to metabolomics is to use the same format in multiomics experiments for reanalysis and data integration. The format do not aim to replace other standards in the field like ISA-TAB but complement them, to allow in the future easy integration across different omics repositories and pipelines.

The steps for this development would be:

These efforts will be led by @mmattano @nilshoffmann @ypriverol

nilshoffmann commented 10 months ago

@ypriverol Instead of pure ISA-Tab, would the MetaboLights flavor of it also work for a start? This study could serve as a starting point: https://www.ebi.ac.uk/metabolights/editor/study/MTBLS1375

mmattano commented 9 months ago

Hi @ypriverol and @nilshoffmann , I thought about how the SDRF proteomics format needs to be modified to fit metabolomics. In general, mass spectrometry-based analysis is very similar regardless of the investigated molecule so I think it’s more of a question of what should be commented on/recommended. For example, fractionating is quite common in proteomics and mentioned as required information in the proteomics SDRF paper but it’s an edge case in metabolomics, so I would treat it as optional. In the recommended section I would suggest adding comments on critical information for reanalysis such as derivatization, if positive or negative mode was used or if the samples stem from an isotopic labeling experiment. There are additional comments that are debatable, since they are either rare or not related to the measurement. For example, multiplexing (using isobaric labeling tags) is quite rare, but an annotation procedure analogous to multiplexing for proteomics could be described. Also, downstream analysis information, first and foremost if this measurement is intended for a targeted- or an untargeted analysis, could be added but is not a part of the data itself. What do you think about this? Can you think of specific information that could/should be added? Should we discuss NMR-based metabolomics as well or limit ourself to MS? Going forward (loosely following Yasset’s outline above) I would suggest that I set up a list of required and recommended information + explanation/glossary. Then we can use this to request community feedback from EuBIC and potentially the broader metabolomics community. In the meantime, I would set up a list of databases and collect example studies. For databases with metadata, and specific metadata formats such as ISAtab, I will write parsers to translate to the SDRF. Then we can call for a EuBIC meeting with whoever wants to be involved, discuss details (I would do this after writing some parsers since they can be adjusted and provide example files to present) and ask for contributions in annotations/checking annotated files. Please let me know what you think about this and what you think I should get started with.

ypriverol commented 9 months ago

Hi @mmattano and @nilshoffmann Here my comments:

First of all thanks for leading this.

Hi @ypriverol and @nilshoffmann , I thought about how the SDRF proteomics format needs to be modified to fit metabolomics. In general, mass spectrometry-based analysis is very similar regardless of the investigated molecule so I think it’s more of a question of what should be commented on/recommended. For example, fractionating is quite common in proteomics and mentioned as required information in the proteomics SDRF paper but it’s an edge case in metabolomics, so I would treat it as optional.

Following these lines we need to check what will be the case for multiplexing studies. We use the label column, to tackle and represend multiplexing making possible that multiple samples are related with the same file but they are differenciated using the label column. It may be the case that in metabolomics multiplexing and labeling is not common, then we can make that column optional.

In the recommended section I would suggest adding comments on critical information for reanalysis such as derivatization, if positive or negative mode was used or if the samples stem from an isotopic labeling experiment. There are additional comments that are debatable, since they are either rare or not related to the measurement. For example, multiplexing (using isobaric labeling tags) is quite rare, but an annotation procedure analogous to multiplexing for proteomics could be described.

Related with my previous comment ☝️.

Also, downstream analysis information, first and foremost if this measurement is intended for a targeted- or an untargeted analysis, could be added but is not a part of the data itself. What do you think about this?

I think columns regarding the type of experiment MUST be part of the data information, target and untargeted related in some part with the way the data is captured and analyzed. We do have those cases in proteomics where we specified the type of the acquisition method. We have two options here:

1- We can define in the same way something like comment [metabolomics profiling] with possible values: untargeted metabolite profiling or targeted metabolite profiling.

2- We can use also the column technology type which two different types of values untargeted metabolite profiling or targeted metabolite profiling

Can you think of specific information that could/should be added? Should we discuss NMR-based metabolomics as well or limit ourself to MS?

The main priority and first proposal must be about MS metabolomics and how the SDRF can facilitate reanalyzis of public proteomics data. Then, we can focus on the other use cases, what do you think?

Going forward (loosely following Yasset’s outline above) I would suggest that I set up a list of required and recommended information + explanation/glossary. Then we can use this to request community feedback from EuBIC and potentially the broader metabolomics community.

Fully, agreed. I think we should have in the same repo a PR with three documents:

1- Proposal for SDRF-metabolomics, in that one we reference the SDRF proteomics for the sections that are common and refine the ones that are different.

2- A set of templates and and one example that represent the proposal.

3- A lit of ontology terms that needs to be added to PSI-MS to futfil the specification.

In the meantime, I would set up a list of databases and collect example studies. For databases with metadata, and specific metadata formats such as ISAtab, I will write parsers to translate to the SDRF.

Agreed.

Then we can call for a EuBIC meeting with whoever wants to be involved, discuss details (I would do this after writing some parsers since they can be adjusted and provide example files to present) and ask for contributions in annotations/checking annotated files. Please let me know what you think about this and what you think I should get started with.

As soon as we have a solid proposal and topics to be discussed, we can present this to EUBIC and HUPO-PSI groups.

ypriverol commented 9 months ago

@mmattano @nilshoffmann I have contacted the metaboligths team, and they provided three different examples of datasets that would be great to have representation in SDRF-metabolomics:

These three examples would be good gold standard datasets for annotations.

mwang87 commented 8 months ago

I think this is a good initiative. One thing that we might want to consider on the analysis portal side to make it easier for people to get it into these formats is automatically convert from GNPS2 metadata to SDRF so it'll just be super easy.

It'll help meet people where they are right now.

Just an example of controlled vocabulary forms of what we have in public GNPS is here:

https://redu.gnps2.org/dump

We've also put some effort into getting as much as we can into the same CV from metabolomics workbench by mining the metadata they already have available.

Best,

Ming

ypriverol commented 8 months ago

@mwang87 Thanks for your comments:

I do agree that we should make easy the conversion from GNPS to SDRF. The major challenges could be to transform free text to CV terms.

1- We can collect all of them. 2- Add them to the corresponding ontology or define the corresponding mapping 3- Finally, integrate in the sdrf-pipelines how to perform the conversion from GNPS to SDRF.

The most important thing here now is defined the columns in the SDRF metabolomics that enables to perform semi-automatic reanalysis. I have a couple of questions:

1- Is GNPS focus in mainly MS targeted and untargeted metabolomics experiments? 2- Should we tackle in the SDRF metabolomics other technologies and analytical methods like MNR? 3- What properties do you think are crucial at the data level comments in SDRF to enable automatic reanalysis at resource level?

nilshoffmann commented 5 months ago

@ypriverol @mmattano @TineClaeys I have started test driving / adapting lesSDRF with the MTBLS1129 study here: https://github.com/nilshoffmann/lesSDRF/tree/sdrf_metabolomics to get a better understanding of currently supported fields / columns vs unsupported ones. One immediate finding is the difference between MetaboLights multi-column encoding of e.g.:

Characteristics[Organism] Term Source REF Term Accession Number
Homo sapiens NCBITAXON http://purl.obolibrary.org/obo/NCBITaxon_9606

In this case, the mapping from Characteristics[Organism] <-> characteristics[organism] is trivial, but there are other more difficult cases.

Not sure if lesSDRF should be able to import / edit MetaboLights ISA files, but if we plan for adaptation / conversion at some point, having a programmatic route would be very helpful, imho.

ypriverol commented 5 months ago

@nilshoffmann @TineClaeys:

We should not support in any tool SDRF for metabolomics if the standard doesn't exist. For example, in the proteomics standard the fraction idnetifier is required/mandatory. I don't think this is needed for metabolomics datasets. What @mmattano is trying to do is to standardize SDRF for metabolomics in this PR. Decide which fields should be required, the optional fields, etc; a similar exercise to what we did in proteomics. We are doing the same for other use cases like affinity proteomics datasets.

My point is for MS-based proteomics we can agree on templates, etc. However, for metabolomics we have to create a format, rules and guidelines. Then, at the moment if we create an SDRF with lesSDRF, it will be wrong.

Can you give your input in the PR created by @mmattano Who is leading the development of SDRF for metabolomics? Can we have a chat @mmattano @nilshoffmann and others interested in the topic early February about metabolomics SDRF.

Here, the PR from @mmattano https://github.com/bigbio/proteomics-sample-metadata/pull/680

nilshoffmann commented 5 months ago

@ypriverol @TineClaeys Please do not misunderstand my intention here. I do not plan to have SDRF for metabolomics supported in lesSDRF until there is a spec, which is also not on me to decide. With my findings, I will of course contribute to @mmattano 's PR #680 Happy to chat any time in February.

ypriverol commented 5 months ago

Thanks @nilshoffmann for your quick reply. I will schedule a meeting for early February and send an email around. @mwang87 would you be able to participate?

mwang87 commented 5 months ago

Happy to chat. We've also done some work on our end for harmonizing metabolights with other repositories in metabolomics so hopefully our work/insight is helpful.

ypriverol commented 5 months ago

@nilshoffmann @mwang87 @mmattano and @bigbio/collaborators here a doodle poll for the meeting about the SDRF for metabolomics https://doodle.com/meeting/participate/id/avmJom0e

deeptijk commented 2 months ago

[#703] Metabolomics specification DRAFT [PLEASE DO NO MERGE]