HUPO-PSI / mzQC

Reporting and exchange format for mass spectrometry quality control data
https://hupo-psi.github.io/mzQC/
Creative Commons Attribution 4.0 International
28 stars 13 forks source link

[CV request] PSM detail table #156

Closed mwalzer closed 8 months ago

mwalzer commented 3 years ago

Purpose: I'd like a table, not specifically a metric but still often desirable to have that identification information on PSMs in an easily accessible format at a point in any workflow, where you would/can not go back to the source files (like mzid). Say, for visualisation, or just to be able to calculate some new metrics. In essence, most columns would be optional to have the most flexibility what might be considered useful for each use case, in case a mzQC file producer is very conscious of the file size. In the same manner, this table could be held in an internally used mzQC file, and then be ditched in the final version of the mzQC, where all statistics and distributions deemed necessary are already calculated (e.g. MS2 ion collection time distribution QC:4000152 - QC:4000156)

Like this:

[Term]
id: QC:4000271
name: Precursor error (ppm) column
def: "Precursor error of identifications in ppm." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_units NCIT:C48523 ! Part Per Million

[Term]
id: QC:4000272
name: Precursor error (Da) column
def: "Precursor error of identifications in Dalton." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_units NCIT:C41127 ! Unified Atomic Mass Unit

[Term]
id: QC:4000273
name: Identification score column
def: "Precursor error of identifications in ppm." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_units MS:1001143 ! PSM-level search engine specific statistic

[Term]
id: QC:4000274
name: Missed cleavages column
def: "The number of missed cleavages of identifications." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_units UO:0000189 ! count unit

[Term]
id: QC:4000275
name: Identified precursor intensity column
def: "The amount of identified precursor intensity in percent. This is the percentage of precursor intensity that can be explained by is the summed intensity of MS2 peaks contributing to the identification." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_units UO:0000187 ! percent

[Term]
id: QC:4000276
name: PSM detail table
def: "The fraction of identified MS2 spectra versus the total number of MS2 spectra. Not that at least MS:1000767 or MS:1003063 columns are required" [PSI:QC]
comment: A table, not specifically a metric but still often desirable to have that information in an easily accessible format at a point in any workflow, where you would/can not go back to the source files (like mzid). 
is_a: QC:4000006 ! table 
is_a: QC:4000009 ! ID based
relationship: has_column: QC:4000116 ! Peptide sequence column
relationship: has_optional_column: MS:1000767 ! native spectrum identifier
relationship: has_optional_column: MS:1003063 ! universal Spectrum Identifier
relationship: has_optional_column: QC:4000111 ! Protein accession column
relationship: has_optional_column: QC:4000112 ! Length value column
relationship: has_optional_column: QC:4000113 ! Target or decoy designation column
relationship: has_optional_column: QC:4000271 ! Precursor error (ppm) column
relationship: has_optional_column: QC:4000272 ! Precursor error (Da) column
relationship: has_optional_column: QC:4000273 ! Identification score column
relationship: has_optional_column: QC:4000274 ! Missed cleavages column
relationship: has_optional_column: QC:4000275 ! Identified precursor intensity column
mwalzer commented 3 years ago

The Identification score column solution is not so nice I think. Any ideas on how to solve this more stable? (multiple scores, etc.)

mwalzer commented 3 years ago

Do you think it would be better to split the table into different aspects of the identifications? A table for the scoring, one for the accuracy, one for the sequences, maybe hydrophobicity, and so on. Otherwise, the list of optional columns would need to be amended every now and as a result would get even longer than it is now.

cbielow commented 3 years ago

The Identification score column solution is not so nice I think. Any ideas on how to solve this more stable? (multiple scores, etc.)

Tricky... One could use a column like FDR q-value, which is somewhat universal, but leaves room for interpretation HOW the FDR was calculated (many possible algorithms for the same thing exist)... Alternatively, maybe a 'primary score' column and a 'primary score type' column which contains a CV Term, which score it is. This column would be highly redundant (containing only a single value for all rows...).. so not ideal. But I cannot think of anything more elegant when restrict ourselves to a single independent table...

cbielow commented 3 years ago

Do you think it would be better to split the table into different aspects of the identifications? A table for the scoring, one for the accuracy, one for the sequences, maybe hydrophobicity, and so on. Otherwise, the list of optional columns would need to be amended every now and as a result would get even longer than it is now.

Thats fine IMHO.

mwalzer commented 3 years ago

I think I should note for completeness sake that we have another entry for parts of the request purpose:

id: QC:4000244
name: QC2 sample mass accuracies
...
mwalzer commented 3 years ago

There also is already:

[Term]
id: QC:4000243
name: Observed mass accuracies column
def: "Observed mass accuracy calculated by 1E6 x (observed mz - theoretical mz)/theoretical mz of selected peptides." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_type MS:1000014 ! accuracy

Which IMO needs its name refined (reflecting the ppm element), so it would be:

[Term]
id: QC:4000243
name: Observed mass accuracies (ppm) column
def: "Observed mass accuracy calculated by 1E6 x (observed mz - theoretical mz)/theoretical mz of selected peptides." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_type MS:1000014 ! accuracy
relationship: has_units NCIT:C48523 ! Part Per Million
synonym: Precursor error (ppm) column [EXACT]

[Term]
id: QC:4000271
name: Observed mass accuracies (da) column
def: "Observed mass accuracy calculated by subtracting observed mz from the theoretical mz." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_type MS:1000014 ! accuracy
relationship: has_units NCIT:C41127 ! Unified Atomic Mass Unit
synonym: Precursor error (ppm) column [EXACT]

[Term]
id: QC:4000272
name: Identification score column
def: "Precursor error of identifications in ppm." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_units MS:1001143 ! PSM-level search engine specific statistic

[Term]
id: QC:4000273
name: Missed cleavages column
def: "The number of missed cleavages of identifications." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_units UO:0000189 ! count unit

[Term]
id: QC:4000274
name: Identified precursor intensity column
def: "The amount of identified precursor intensity in percent. This is the percentage of precursor intensity that can be explained by is the summed intensity of MS2 peaks contributing to the identification." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_units UO:0000187 ! percent

[Term]
id: QC:4000275
name: PSM detail table
def: "The fraction of identified MS2 spectra versus the total number of MS2 spectra. Not that at least MS:1000767 or MS:1003063 columns are required" [PSI:QC]
comment: A table, not specifically a metric but still often desirable to have that information in an easily accessible format at a point in any workflow, where you would/can not go back to the source files (like mzid). 
is_a: QC:4000006 ! table 
is_a: QC:4000009 ! ID based
relationship: has_column: QC:4000116 ! Peptide sequence column
relationship: has_optional_column: MS:1000767 ! native spectrum identifier
relationship: has_optional_column: MS:1003063 ! universal Spectrum Identifier
relationship: has_optional_column: QC:4000111 ! Protein accession column
relationship: has_optional_column: QC:4000112 ! Length value column
relationship: has_optional_column: QC:4000113 ! Target or decoy designation column
relationship: has_optional_column: QC:4000243 ! Observed mass accuracies (ppm) column
relationship: has_optional_column: QC:4000271 ! Observed mass accuracies (da) column
relationship: has_optional_column: QC:4000272 ! Identification score column
relationship: has_optional_column: QC:4000273 ! Missed cleavages column
relationship: has_optional_column: QC:4000274 ! Identified precursor intensity column
cbielow commented 3 years ago

thoughts from the TC:

bittremieux commented 3 years ago

With the columns you have outlined, this is basically recreating the PSM section of mzTab. That seems a bit silly to me. Can we use mzTab for the "raw" data and then list that as input file? And then the mzQC would just include the higher-level data.

bittremieux commented 8 months ago

This is (partially) covered by current metrics MS:4000078 | QC2 sample mass accuracies and MS:4000079 | QC2 sample intensities. Other information is present in alternative identification files (mzTab, mzIdentML) and doesn't need to be fully reproduced in mzQC.