Closed andrewrobertjones closed 8 years ago
Hi, correct me if I am wrong, but I am thinking on one situation, in which we have a set of spectra that are searched with different parameters, but then the user wants to merge all results of different searches in one single mzIdentML. In that case, we have the same spectrumID + spectra_Data_ref pairs in both searches, but in different SpectrumIdentificationLists. So, I guess that the spectrumID +spectra_data_ref should be unique in each SIL, isn't? Is the schema-level check specific for each SIL? The validator has a object rule for that, but it is not checking for the uniqueness in each SIL.
Hi @smdb21 The point of the change is to prevent the encoding you are describing - although I realise some people may not like it, hence it needs discussion. I am pushing the concept that a "valid" mzIdentML 1.2 file contains only "final" results. This means that if there are multiple SILists, the parser MUST process them with no exceptions.
As further rationale - in our pipelines (and presumably many others), we create 6-10 intermediate mzIdentML files from processing via various steps. We discard all the intermediate files before sending to PRIDE or another data consumer. The results from individual search engines prior to combining are an example of an intermediate file. A piece of reading software should (in my opinion) not have to do the work of figuring out the ranked PSMs for a given spectrum. If different SILists contain the same spectrum, then it places work on the reading software to make a judgement. Given that mzid 1.2 has already got pretty complicated to implement, I am arguing that this is a simplification worth having. Happy to revert the decision if there is general feeling that this is a crucial feature to support
I would vote for removing the intermediate lists as well.
We could introduce an optional field (probably in Inputs?), which links to the intermediate mzId files and facilitates keeping track of the way the final list was calculated. Thus, uploading the final mzId and the intermediates to PRIDE etc. would still show the whole workflow. At least, given you know the software that did it.
I also support the change. Life for readers would be much easier, and if needed, people can keep track of all those intermediate results in different mzIdentML files. Something that has never been discussed in detail, if whether there is the need to create a "wrapper" file that would connect all files belonging to the same experiment together (this is not only applicable to mzidentML, but it could also be applied to link together explicitly identification and quantification results). The most similar thing that there is now is the ProteomeXchange XML format and/or the corresponding tab-delimited file that we use for handling submissions. They would need some extra "tags" to do this, but it would be possible if we think this is needed.
I am not sure about this change. I think is quite common to have more than one search over the same dataset (same spectra), using different parameters (such as HEAVY fixed masses, or additional PTMs) and then to apply an statistical validation over all of them, resulting in a single protein and peptide list, which means a single resulting file. In this case there is not final PSM list, all of them are final. The final protein list will contain PSMs from all of these searches. So I don't see the reason in this case to have to split into several mzIdentML files. What do you think?
@smdb21 In this case, you can still represent the final results in a single file, you just have to combine the results into a single SIList -where each SIResult is unique. This makes it easier to work out what is being claimed to have been identified. If the lists are kept separate (in one file with 2 lists) or 2 files (with one list), a reader might double count the number of spectra queried (and possibly over count PSMs), or would need its own logic for combining the same spectrum in different list - which would probably not exist.
If the lists are kept separate (in one file with 2 lists) or 2 files (with one list), a reader might double count the number of spectra queried (and possibly over count PSMs), or would need its own logic for combining the same spectrum in different list - which would probably not exist.
I agree with Andy here, not having separate lists reduces the number of ways a reader could interpret the consumed mzid(s)
But, then, what about the search parameters? In case of having a single mzIdentML file with a single SIList merging PSMs coming from different searches with different parameters, can we have in AnalysisCollection, two different SpectrumIdentification elements referencing to different SpectrumIdentificationProtocols (different search parameters) and same SIList? like:
...
<SpectrumIdentificationList id="SIL_1"> ... </>
...
<AnalysisCollection>
<SpectrumIdentification spectrumIdentificationProtocol_ref="parameters_1" spectrumIdentificationList_ref="SIL_1"> ... </>
<SpectrumIdentification spectrumIdentificationProtocol_ref="parameters_2" spectrumIdentificationList_ref="SIL_1"> ... </>
</AnalysisCollection>
...
In this case, we lose the ability to know with which parameters an specific PSM was searched, right? We know that SIL_1 is the combination of 2 searches, but each PSM we don't know from which one is coming from. Maybe that is not an important issue anyway...
That is right, this information would be lost in the merged file. I think to keep this, any kind of mapping, maybe something like a metafile suggested by @javizca would be a good idea.
Regarding the SpectrumIdentificationProtocol: currently, when PIA is exporting a merged PSM list, an additional SpectrumIdentificationProtocol for the merging is generated. This, though, only contains the merging settings and not the original search settings, and does not solve the problem shown by @smdb21
Here is what is written about protocols in the spec doc about the SIProtocol:
The <SpectrumIdentification> element MUST reference a <SpectrumIdentificationProtocol> holding representative parameters used across all search engines (i.e. search tolerances, enzyme and modifications), since these are MANDATORY elements. If the same search parameters were not employed in all source searches, the parameters should be set with superset or widest values i.e. all modifications that have been searched, widest tolerances and so on. All search engines that have been employed SHOULD be represented within the <AnalysisSoftwareList>. It must also be highlighted that mzIdentML cannot be used to model the order in which the software was used (it does not support workflows).
I realise that there is some information loss here, but as noted above, there is no way in a standard to capture all meta-data of every stage in a way that any reading software could really process and understand it.
In terms of a wrapper format. This could be useful but I think it is overkill to consider this a way to capture all intermediate files in a workflow. In my opinion, this would for relating together quant, ident, peak list and raw files - e.g. based on the PX XML.
I would like to get agreement on this one soon if possible. Can you give this comments thumbs up or down, and if get enough thumbs up, I will close it. If anyone still feels strongly, vote thumbs down and add another comment below. Thanks! Andy
I agree that is the less risky option. So, if we agree on this, I will update my example file accordingly.
Maybe this should go to a new issue, but:
Do we have a recommendation on how to handle
Maybe we could also introduce a CvParam which indicates from which original runs each single
Agreed to use mechanism and closing
We decided in Ghent to get rid of the concept of "final PSM list" and "intermediate list" - only final results are allowed to make reading easier. @germa Please remove this from the mapping file and validator once agreed. Log any complaints here!
I have added a schema-level check that spectrumID + spectra_Data_ref are unique - to lock down this potential for error in results files.
Please check that this is okay on all your example files.