bigbio / proteomics-sample-metadata

The Proteomics sample metadata: Standard for experimental design annotation in proteomics datasets
GNU General Public License v2.0
76 stars 107 forks source link

How to encode search parameters in the SDRF #293

Closed daichengxin closed 4 years ago

daichengxin commented 4 years ago

In the PMID: 25043054 experiment, three softwares were used to identify the peptides, and the parameter settings for each were different, as follows:

Pepitome and MyriMatch used precursor tolerances of 10 ppm, while MS-GF1used a 20-ppm window; all three algorithms allowed fragments to vary by up to 0.5 m / z, and both database search engines considered semi-tryptic peptides equally with fully tryptic peptides , allowed for isotopic error in precursor ion selection, conducted on-the-fly peptide sequence reversal, and applied static 157 modifications to cysteines and dynamic 116 oxidations to methionines. MS-GF1 considered acetylation for protein amino termini, whereas MyriMatch added pyroglutamine modifications to theN termini of peptides starting with Gln residues. Pepitome considered any modification variants and trypsin specificities that were included in the spectral library.

How should I reasonably express the above parameters?

ypriverol commented 4 years ago

The main difference here is the precursor mass tolerance and fragment mass tolerance. This is one of the challenges of adding in the file the analysis information, this can be a lot of combinations that make it difficult to capture all of them.

This is also the main difference why we are capturing now MS2 Analyzer as suggested by the @levitsky team.

I recommend capturing here only one parameter, in this case: 20 ppm. We probably need to find a more robust way to capture parameters for search engines in the protocol. I will say also that this is not the majority of the datasets.

levitsky commented 4 years ago

Yes, I should say this is why we pushed against using search parameters in annotation and in favor of material details like MS2 analyzer. SDRF is supposed to describe "the relationship between the sample and the data file". Instrument is a part of that relationship, and search parameters simply are not.

If we were annotating the relationship from sample to mzid files or pepXML files, then yes, there would be a perfect place for all search parameters, and we wouldn't have this problem, because there'd be one-to-one correspondence: different file for each setting. This could even make some sense actually, as search result files are deposited on PRIDE along with raw files, perhaps we could include them in SDRF.

But in relation to each raw file these values are nothing more than a comment. They say nothing about the file. I think it makes sense to either omit all values (they are mandatory right now but I think this should be changed) or add them all as repeated columns.

ypriverol commented 4 years ago

Yes, I should say this is why we pushed against using search parameters in annotation and in favor of material details like MS2 analyzer. SDRF is supposed to describe "the relationship between the sample and the data file". Instrument is a part of that relationship, and search parameters simply are not.

Agreed and we should evaluate all the columns in the data file and make mandatory only the one that are needed, so far (https://github.com/bigbio/proteomics-metadata-standard/blob/master/experimental-design/README.adoc#3-from-samples-to-assay-msrun):

Candidates to be formally mandatory:

If we were annotating the relationship from sample to mzid files or pepXML files, then yes, there would be a perfect place for all search parameters, and we wouldn't have this problem, because there'd be one-to-one correspondence: different file for each setting. This could even make some sense actually, as search result files are deposited on PRIDE along with raw files, perhaps we could include them in SDRF.

I would prefer to do not to go in this direction on the first release. The files will be huge and the amount of annotations will be massive. Now is possible to have multiple sdrf associated with the same file. In the future, we can have a second version of the specification to annotate other file relations, like Search Results.

But in relation to each raw file these values are nothing more than a comment. They say nothing about the file. I think it makes sense to either omit all values (they are mandatory right now but I think this should be changed) or add them all as repeated columns.

This is an important point: @Nuno from MassIVE and others have mentioned before that search parameters should be encoded in the SDRF, while I understand that and I don't like to do it. I found it important to encode as additional columns information that 99% of the papers reported in the Methods sections: PTMs and Tolerances.

I think, is optional for the annotator and submitter to select one or more of these values in annotating them. IT is also related to the discussion of https://github.com/bigbio/proteomics-metadata-standard/pull/285 that would be great to have a system to encode dependencies on columns because you can define for example MSGF precursor tolerance 20 ppm and MyriMatch precursor tolerance 10 ppm and the problem is solved.

levitsky commented 4 years ago

Agreed and we should evaluate all the columns in the data file and make mandatory only the one that are needed, so far (https://github.com/bigbio/proteomics-metadata-standard/blob/master/experimental-design/README.adoc#3-from-samples-to-assay-msrun):

  • label
  • fraction
  • data file

Candidates to be formally mandatory:

  • Instrument
  • Ms2 analyzer

SDRF validator still considers tolerances as mandatory, I will make a PR to change that.

I would prefer to do not to go in this direction on the first release. The files will be huge and the amount of annotations will be massive. Now is possible to have multiple sdrf associated with the same file. In the future, we can have a second version of the specification to annotate other file relations, like Search Results.

I agree this is not a trivial change and needs to be thoroughly discussed, but I'm not sure what you mean about huge files and massive amount of annotations. I think most datasets don't include search results at all or include one search result per raw file?

I think, is optional for the annotator and submitter to select one or more of these values in annotating them. IT is also related to the discussion of #285 that would be great to have a system to encode dependencies on columns because you can define for example MSGF precursor tolerance 20 ppm and MyriMatch precursor tolerance 10 ppm and the problem is solved.

I agree, it makes sense to go that way if we really want to be annotating search parameters.

StSchulze commented 4 years ago

In the context of making search parameters optional and having the SDRF as a sample <-> data file relationship format, I would like to revisit the PTMs column, i.e. comments[modification parameters]. Currently, this is used to describe the modifications included as search parameters and should therefore be optional. However, in that case, essential information about e.g. cysteine alkylation would be missing. In my opinion, there should therefore be an additional mandatory column to describe the type of sample treatment. Similar to the label column, you would then have one column describing the sample property and one column describing the search parameter.

ypriverol commented 4 years ago

We have solved this issue. Please open if you need more clarification.