Closed daichengxin closed 4 years ago
The main difference here is the precursor mass tolerance and fragment mass tolerance. This is one of the challenges of adding in the file the analysis information, this can be a lot of combinations that make it difficult to capture all of them.
This is also the main difference why we are capturing now MS2 Analyzer as suggested by the @levitsky team.
I recommend capturing here only one parameter, in this case: 20 ppm. We probably need to find a more robust way to capture parameters for search engines in the protocol. I will say also that this is not the majority of the datasets.
Yes, I should say this is why we pushed against using search parameters in annotation and in favor of material details like MS2 analyzer. SDRF is supposed to describe "the relationship between the sample and the data file". Instrument is a part of that relationship, and search parameters simply are not.
If we were annotating the relationship from sample to mzid files or pepXML files, then yes, there would be a perfect place for all search parameters, and we wouldn't have this problem, because there'd be one-to-one correspondence: different file for each setting. This could even make some sense actually, as search result files are deposited on PRIDE along with raw files, perhaps we could include them in SDRF.
But in relation to each raw file these values are nothing more than a comment
. They say nothing about the file. I think it makes sense to either omit all values (they are mandatory right now but I think this should be changed) or add them all as repeated columns.
Yes, I should say this is why we pushed against using search parameters in annotation and in favor of material details like MS2 analyzer. SDRF is supposed to describe "the relationship between the sample and the data file". Instrument is a part of that relationship, and search parameters simply are not.
Agreed and we should evaluate all the columns in the data file and make mandatory only the one that are needed, so far (https://github.com/bigbio/proteomics-metadata-standard/blob/master/experimental-design/README.adoc#3-from-samples-to-assay-msrun):
Candidates to be formally mandatory:
If we were annotating the relationship from sample to mzid files or pepXML files, then yes, there would be a perfect place for all search parameters, and we wouldn't have this problem, because there'd be one-to-one correspondence: different file for each setting. This could even make some sense actually, as search result files are deposited on PRIDE along with raw files, perhaps we could include them in SDRF.
I would prefer to do not to go in this direction on the first release. The files will be huge and the amount of annotations will be massive. Now is possible to have multiple sdrf associated with the same file. In the future, we can have a second version of the specification to annotate other file relations, like Search Results
.
But in relation to each raw file these values are nothing more than a
comment
. They say nothing about the file. I think it makes sense to either omit all values (they are mandatory right now but I think this should be changed) or add them all as repeated columns.
This is an important point: @Nuno from MassIVE and others have mentioned before that search parameters should be encoded in the SDRF, while I understand that and I don't like to do it. I found it important to encode as additional columns information that 99% of the papers reported in the Methods sections: PTMs
and Tolerances
.
I think, is optional for the annotator and submitter to select one or more of these values in annotating them. IT is also related to the discussion of https://github.com/bigbio/proteomics-metadata-standard/pull/285 that would be great to have a system to encode dependencies on columns because you can define for example MSGF precursor tolerance
20 ppm and MyriMatch precursor tolerance
10 ppm and the problem is solved.
Agreed and we should evaluate all the columns in the data file and make mandatory only the one that are needed, so far (https://github.com/bigbio/proteomics-metadata-standard/blob/master/experimental-design/README.adoc#3-from-samples-to-assay-msrun):
- label
- fraction
- data file
Candidates to be formally mandatory:
- Instrument
- Ms2 analyzer
SDRF validator still considers tolerances as mandatory, I will make a PR to change that.
I would prefer to do not to go in this direction on the first release. The files will be huge and the amount of annotations will be massive. Now is possible to have multiple sdrf associated with the same file. In the future, we can have a second version of the specification to annotate other file relations, like
Search Results
.
I agree this is not a trivial change and needs to be thoroughly discussed, but I'm not sure what you mean about huge files and massive amount of annotations. I think most datasets don't include search results at all or include one search result per raw file?
I think, is optional for the annotator and submitter to select one or more of these values in annotating them. IT is also related to the discussion of #285 that would be great to have a system to encode dependencies on columns because you can define for example
MSGF precursor tolerance
20 ppm andMyriMatch precursor tolerance
10 ppm and the problem is solved.
I agree, it makes sense to go that way if we really want to be annotating search parameters.
In the context of making search parameters optional and having the SDRF as a sample <-> data file relationship format, I would like to revisit the PTMs column, i.e. comments[modification parameters]. Currently, this is used to describe the modifications included as search parameters and should therefore be optional. However, in that case, essential information about e.g. cysteine alkylation would be missing. In my opinion, there should therefore be an additional mandatory column to describe the type of sample treatment. Similar to the label column, you would then have one column describing the sample property and one column describing the search parameter.
We have solved this issue. Please open if you need more clarification.
In the PMID: 25043054 experiment, three softwares were used to identify the peptides, and the parameter settings for each were different, as follows:
How should I reasonably express the above parameters?