HUPO-PSI / mzTab

mzTab Reporting MS-based Proteomics and Metabolomics Results
https://hupo-psi.github.io/mzTab
37 stars 16 forks source link

Some issues with file MTBLS263 #163

Closed andrewrobertjones closed 5 years ago

andrewrobertjones commented 5 years ago

Hi all, I am using some of the examples in MTBLS263 file for the paper but I think some of the logic for SME rows is not as intended by the specs.

`SME 18 1022 CHEBI:25094 C6H14N2O2 null null Lysine null null [M+H]+ 147.112 1 147.113 ms_run[1]:index=856 \ ms_run[1]:index=872 \ ms_run[1]:index=886 \ ms_run[2]:index=856 \ ms_run[2]:index=872 \ ms_run[2]:index=888 \ ms_run[3]:index=862 \ ms_run[3]:index=876 \ ms_run[3]:index=890 \ ms_run[4]:index=836 \ ms_run[4]:index=850 \ ms_run[4]:index=864 \ ms_run[5]:index=836 \ ms_run[5]:index=850 \ ms_run[5]:index=864 \ ms_run[6]:index=834 \ ms_run[6]:index=848 \ ms_run[6]:index=862 [MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1000511,ms level,2] 53.8925 0 98.898 1 1282.28 1284    
SME 19 1022 CHEBI:25094 C6H14N2O2 null null Lysine null null [M+Na]+ 169.094 1 169.095 ms_run[1]:index=882 | ms_run[2]:index=870 | ms_run[3]:index=872 | ms_run[4]:index=846 | ms_run[5]:index=846 | ms_run[6]:index=854 [MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1000511,ms level,2] 53.8925 0 98.898 1 1279.2 1284    

`

In these rows, they share the same evidence_input_id, which is supposed to indicate that the same evidence gave rise to different results (for example). Here the same evidence_input_id is used to indicate that different inputs gave rise to IDs of the same compound (or at least different adduct forms), this needs to be fixed otherwise it will confuse readers.

A second possible issue relates to how much data to compress on each row. We discussed this one previously I recall, that ideally one row of SME should be a single search event, unless software aggregates multiple spectra (in this case) for a search. In this file, one row contains the combined results from searching lots of MS2 spectra, and presumably only the best score is reported. I think a preferred encoding would be to enumerate all spectra that were searched across multiple rows, so that it is explicitly clear which fragmentation spectrum gave rise to the score reported.

@jmrein any chance you could take a look?

kayrein commented 5 years ago

Hmmmmmm, I'm apparently having some issues groking that column.

How about this: using the charged form of the Progenesis-style identifier in those columns (for Progenesis, and the purposes of this example). So SME18 would have evidence_input_id 21.37_147.1122m/z and SME19 would have evidence_input_id 21.37_169.0943m/z.

This has the properties that:

MetaScope scores fragmentation matches across the aggregate of fragmentation spectra (across both runs and adduct forms) so it will be naturally prone towards a compressed evidence table. That's relatively atypical so it would be worth having more example files, but in this case it's true to how the data was processed.

andrewrobertjones commented 5 years ago

Hi Joel,

Yes this sounds good. Can you also put in CV terms into the files instead of user params, now they've been added:

[MS,MS:1002879,Progenesis QI,2.4.6505.48857]

best_id_confidence_measure

[MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1002889,Progenesis MetaScope Score,]

etc

Thanks!

kayrein commented 5 years ago

I have made a pull request to fix this (including the CV terms bit): https://github.com/HUPO-PSI/mzTab/pull/167

nilshoffmann commented 5 years ago

Closing for now, changes were merged in #167