Some issues with file MTBLS263

andrewrobertjones commented 5 years ago

Hi all, I am using some of the examples in MTBLS263 file for the paper but I think some of the logic for SME rows is not as intended by the specs.

`SME	18	1022	CHEBI:25094	C6H14N2O2	null	null	Lysine	null	null	[M+H]+	147.112	1	147.113	ms_run[1]:index=856 \	ms_run[1]:index=872 \	ms_run[1]:index=886 \	ms_run[2]:index=856 \	ms_run[2]:index=872 \	ms_run[2]:index=888 \	ms_run[3]:index=862 \	ms_run[3]:index=876 \	ms_run[3]:index=890 \	ms_run[4]:index=836 \	ms_run[4]:index=850 \	ms_run[4]:index=864 \	ms_run[5]:index=836 \	ms_run[5]:index=850 \	ms_run[5]:index=864 \	ms_run[6]:index=834 \	ms_run[6]:index=848 \	ms_run[6]:index=862	[MS,MS:1002889,Progenesis MetaScope Score,]	[MS,MS:1000511,ms level,2]	53.8925	0	98.898	1	1282.28	1284
SME	19	1022	CHEBI:25094	C6H14N2O2	null	null	Lysine	null	null	[M+Na]+	169.094	1	169.095	ms_run[1]:index=882 \| ms_run[2]:index=870 \| ms_run[3]:index=872 \| ms_run[4]:index=846 \| ms_run[5]:index=846 \| ms_run[6]:index=854	[MS,MS:1002889,Progenesis MetaScope Score,]	[MS,MS:1000511,ms level,2]	53.8925	0	98.898	1	1279.2	1284

`

In these rows, they share the same evidence_input_id, which is supposed to indicate that the same evidence gave rise to different results (for example). Here the same evidence_input_id is used to indicate that different inputs gave rise to IDs of the same compound (or at least different adduct forms), this needs to be fixed otherwise it will confuse readers.

A second possible issue relates to how much data to compress on each row. We discussed this one previously I recall, that ideally one row of SME should be a single search event, unless software aggregates multiple spectra (in this case) for a search. In this file, one row contains the combined results from searching lots of MS2 spectra, and presumably only the best score is reported. I think a preferred encoding would be to enumerate all spectra that were searched across multiple rows, so that it is explicitly clear which fragmentation spectrum gave rise to the score reported.

@jmrein any chance you could take a look?

kayrein commented 5 years ago

Hmmmmmm, I'm apparently having some issues groking that column.

How about this: using the charged form of the Progenesis-style identifier in those columns (for Progenesis, and the purposes of this example). So SME18 would have evidence_input_id 21.37_147.1122m/z and SME19 would have evidence_input_id 21.37_169.0943m/z.

This has the properties that:

different adduct forms of the same compound have different evidence_input_id
different identifications from the same inputs have the same evidence_input_id
different identifications for a multiply-adducted compound (e.g. isomers) will share the same set of evidence_input_ids.

MetaScope scores fragmentation matches across the aggregate of fragmentation spectra (across both runs and adduct forms) so it will be naturally prone towards a compressed evidence table. That's relatively atypical so it would be worth having more example files, but in this case it's true to how the data was processed.

andrewrobertjones commented 5 years ago

Hi Joel,

Yes this sounds good. Can you also put in CV terms into the files instead of user params, now they've been added:

[MS,MS:1002879,Progenesis QI,2.4.6505.48857]

best_id_confidence_measure

[MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1002889,Progenesis MetaScope Score,] [MS,MS:1002889,Progenesis MetaScope Score,]

etc

Thanks!

kayrein commented 5 years ago

I have made a pull request to fix this (including the CV terms bit): https://github.com/HUPO-PSI/mzTab/pull/167

nilshoffmann commented 5 years ago

Closing for now, changes were merged in #167

HUPO-PSI / mzTab

Some issues with file MTBLS263 #163

best_id_confidence_measure