HUPO-PSI / mzIdentML

Repository for mzIdentML and the corresponding examples
23 stars 24 forks source link

Multiple search engine encoding #5

Closed andrewrobertjones closed 8 years ago

andrewrobertjones commented 8 years ago

We decided in Ghent to get rid of the concept of "final PSM list" and "intermediate list" - only final results are allowed to make reading easier. @germa Please remove this from the mapping file and validator once agreed. Log any complaints here!

I have added a schema-level check that spectrumID + spectra_Data_ref are unique - to lock down this potential for error in results files.

Please check that this is okay on all your example files.

smdb21 commented 8 years ago

Hi, correct me if I am wrong, but I am thinking on one situation, in which we have a set of spectra that are searched with different parameters, but then the user wants to merge all results of different searches in one single mzIdentML. In that case, we have the same spectrumID + spectra_Data_ref pairs in both searches, but in different SpectrumIdentificationLists. So, I guess that the spectrumID +spectra_data_ref should be unique in each SIL, isn't? Is the schema-level check specific for each SIL? The validator has a object rule for that, but it is not checking for the uniqueness in each SIL.

andrewrobertjones commented 8 years ago

Hi @smdb21 The point of the change is to prevent the encoding you are describing - although I realise some people may not like it, hence it needs discussion. I am pushing the concept that a "valid" mzIdentML 1.2 file contains only "final" results. This means that if there are multiple SILists, the parser MUST process them with no exceptions.

As further rationale - in our pipelines (and presumably many others), we create 6-10 intermediate mzIdentML files from processing via various steps. We discard all the intermediate files before sending to PRIDE or another data consumer. The results from individual search engines prior to combining are an example of an intermediate file. A piece of reading software should (in my opinion) not have to do the work of figuring out the ranked PSMs for a given spectrum. If different SILists contain the same spectrum, then it places work on the reading software to make a judgement. Given that mzid 1.2 has already got pretty complicated to implement, I am arguing that this is a simplification worth having. Happy to revert the decision if there is general feeling that this is a crucial feature to support

julianu commented 8 years ago

I would vote for removing the intermediate lists as well.

We could introduce an optional field (probably in Inputs?), which links to the intermediate mzId files and facilitates keeping track of the way the final list was calculated. Thus, uploading the final mzId and the intermediates to PRIDE etc. would still show the whole workflow. At least, given you know the software that did it.

javizca commented 8 years ago

I also support the change. Life for readers would be much easier, and if needed, people can keep track of all those intermediate results in different mzIdentML files. Something that has never been discussed in detail, if whether there is the need to create a "wrapper" file that would connect all files belonging to the same experiment together (this is not only applicable to mzidentML, but it could also be applied to link together explicitly identification and quantification results). The most similar thing that there is now is the ProteomeXchange XML format and/or the corresponding tab-delimited file that we use for handling submissions. They would need some extra "tags" to do this, but it would be possible if we think this is needed.

smdb21 commented 8 years ago

I am not sure about this change. I think is quite common to have more than one search over the same dataset (same spectra), using different parameters (such as HEAVY fixed masses, or additional PTMs) and then to apply an statistical validation over all of them, resulting in a single protein and peptide list, which means a single resulting file. In this case there is not final PSM list, all of them are final. The final protein list will contain PSMs from all of these searches. So I don't see the reason in this case to have to split into several mzIdentML files. What do you think?

andrewrobertjones commented 8 years ago

@smdb21 In this case, you can still represent the final results in a single file, you just have to combine the results into a single SIList -where each SIResult is unique. This makes it easier to work out what is being claimed to have been identified. If the lists are kept separate (in one file with 2 lists) or 2 files (with one list), a reader might double count the number of spectra queried (and possibly over count PSMs), or would need its own logic for combining the same spectrum in different list - which would probably not exist.

mwalzer commented 8 years ago

If the lists are kept separate (in one file with 2 lists) or 2 files (with one list), a reader might double count the number of spectra queried (and possibly over count PSMs), or would need its own logic for combining the same spectrum in different list - which would probably not exist.

I agree with Andy here, not having separate lists reduces the number of ways a reader could interpret the consumed mzid(s)

smdb21 commented 8 years ago

But, then, what about the search parameters? In case of having a single mzIdentML file with a single SIList merging PSMs coming from different searches with different parameters, can we have in AnalysisCollection, two different SpectrumIdentification elements referencing to different SpectrumIdentificationProtocols (different search parameters) and same SIList? like:

...
<SpectrumIdentificationList id="SIL_1"> ... </>
...
<AnalysisCollection>
   <SpectrumIdentification spectrumIdentificationProtocol_ref="parameters_1" spectrumIdentificationList_ref="SIL_1"> ... </>
   <SpectrumIdentification spectrumIdentificationProtocol_ref="parameters_2" spectrumIdentificationList_ref="SIL_1"> ... </>
</AnalysisCollection>
...

In this case, we lose the ability to know with which parameters an specific PSM was searched, right? We know that SIL_1 is the combination of 2 searches, but each PSM we don't know from which one is coming from. Maybe that is not an important issue anyway...

julianu commented 8 years ago

That is right, this information would be lost in the merged file. I think to keep this, any kind of mapping, maybe something like a metafile suggested by @javizca would be a good idea.

Regarding the SpectrumIdentificationProtocol: currently, when PIA is exporting a merged PSM list, an additional SpectrumIdentificationProtocol for the merging is generated. This, though, only contains the merging settings and not the original search settings, and does not solve the problem shown by @smdb21

andrewrobertjones commented 8 years ago

Here is what is written about protocols in the spec doc about the SIProtocol:

The <SpectrumIdentification> element MUST reference a <SpectrumIdentificationProtocol> holding representative parameters used across all search engines (i.e. search tolerances, enzyme and modifications), since these are MANDATORY elements. If the same search parameters were not employed in all source searches, the parameters should be set with superset or widest values i.e. all modifications that have been searched, widest tolerances and so on. All search engines that have been employed SHOULD be represented within the <AnalysisSoftwareList>. It must also be highlighted that mzIdentML cannot be used to model the order in which the software was used (it does not support workflows).

I realise that there is some information loss here, but as noted above, there is no way in a standard to capture all meta-data of every stage in a way that any reading software could really process and understand it.

In terms of a wrapper format. This could be useful but I think it is overkill to consider this a way to capture all intermediate files in a workflow. In my opinion, this would for relating together quant, ident, peak list and raw files - e.g. based on the PX XML.

I would like to get agreement on this one soon if possible. Can you give this comments thumbs up or down, and if get enough thumbs up, I will close it. If anyone still feels strongly, vote thumbs down and add another comment below. Thanks! Andy

smdb21 commented 8 years ago

I agree that is the less risky option. So, if we agree on this, I will update my example file accordingly.

julianu commented 8 years ago

Maybe this should go to a new issue, but: Do we have a recommendation on how to handle s with multiple instances of the same score (or any other Param)? For example if identifications from different search engines are merged using the FDRScore and CombinedFDRScore. So for each search engine run there is (optimally) one FDRScore in the combined , should these be reported? Otherwise, you could have multiple say Mascot Scores, if you searched with different modifications, etc.

Maybe we could also introduce a CvParam which indicates from which original runs each single gets its evidence, or is there one already which I missed? Alternatively something like a new optional evidence-element could be added to the combined s, which holds the original scores etc. but as that is a schema change I guess we don't want to do that now.

andrewrobertjones commented 8 years ago

Agreed to use mechanism and closing