HUPO-PSI / mzTab

mzTab Reporting MS-based Proteomics and Metabolomics Results
https://hupo-psi.github.io/mzTab
39 stars 17 forks source link

SME collated or separated over several MS-runs – coelution and varying levels of evidence #133

Closed hartler closed 6 years ago

hartler commented 6 years ago

SME_collated.mztab.txt SME_separated.mztab.txt

The primary question is whether “similar” SME information from several MS-runs should be collated in a single line or split in separate ones. In order to exemplify both approaches, I attached an example of each version (SME_collated.mztab.txt and SME_separated.mztab.txt) taken from a sample containing authentic lipid standards (with known sn-positions of the attached chains). The only thing that is not present in this data is a true coelution, but I assume that the false positive identifications might be sufficient as examples.

A small remark regarding lipid nomenclature: the annotation of the same lipid can be different depending on the obtainable information from MS/MS, examples for the same species: • PE 36:1: only the sum of carbon atoms and double bonds the chains are comprised of is known – no details about individual chains – sum composition • PE 18:0_18:1: the chains are known – in this case, PE 36:1 has the two fatty acyl chains 18:0 and 18:1 • PE 18:0/18:1: the position of the chains is known – in this case, 18:0 is at sn-1, and 18:1 at sn-2

In order to clarify the obtained evidence, I added the column “opt_lda_identification” for both examples, and additionally the columns “opt_lda_id_ms_run[1…x]” in the collated version.

Identifications where chain information was not always obtainable in all MS-runs: For the collated version, if chain information is not always present, I took the approach to split the evidence in two lines (e.g. lines 428 and 429 which correspond to the SME_IDs 46 and 47). In this example, for PE 36:1 at 29.05 min, the chains (18:0/18:1) were identified in 4 out of 5 cases, while in MS-run 4, no chain identification was possible. This split might be slightly confusing for users, since the entries in the line “PE 36:1_29.05” would be empty for the 4 cases where chains were detected, and the 1 entry in the line “PE 36:1_29.05 | 18:0/18:1” would be empty for the one case where no chains were assigned. For the separated version, the same information is split in five lines, where it is evident for each line what was detected (see lines 562-566 in SME_separated.mztab.txt). A further collation of the lines 428 and 429 (SME_ID 5 and 6) would make no sense in my opinion, since the same peak might contain as well coeluting species (several different chain combinations), as you can see in the lines 387 and 388, where for PG 32:0 at 24.10 min, in addition to the correct 16:0/16:0 example the false positive 6:0_26:0 was detected in MS-run 3. In this example it is a false positive, but in biological data, several coeluting species in one peak occur frequently.

Identifications where chain information is always present, but positions cannot be assigned in all cases: For the collated version, I use here one line (see line 445 of SME_collated.mztab.txt), since typically decisions where no positions are assigned are cases which are close to the threshold where a position would have been assigned, but I designed the decision rules regarding position determination rather in a conservative manner to avoid wrong deductions.

Advantages of the collated version: • Consumes less disk space, since several similar information is not repeated • Information is more concise, but should be extended by optional columns to clarify the truly detected evidence (such as the additional optional columns)

Advantages of the separated version: • ‘spectra_ref’ column is easier legible, since it contains only spectra references of one file (might be enormous otherwise); the often quite extensive entries might be problematic (possibly for Excel?) when hundred or more MS-runs are compared • Each line presents the truly derived evidence without potential confusion • Is advantageous regarding data-post processing methods, e.g. for detection of false positive identifications, since columns such as ‘exp_mass_to_charge’ can be directly assigned to an individual MS-run, and is not a (weighted) mean of the individual runs.

In summary, I would say both version have their pros and cons – advise is appreciated.

The MTBLS263.mztab.txt example looks rather like the collated version, since spectra_refs to several MS-runs are given.

andrewrobertjones commented 6 years ago

Discussed on call today. For collated example we think you should reference only to the most specific id, then with optional columns for the extra levels.

Prefer the model whereby you have multiple rows for different evidence streams.