Closed mwalzer closed 8 months ago
The Identification score column
solution is not so nice I think. Any ideas on how to solve this more stable? (multiple scores, etc.)
Do you think it would be better to split the table into different aspects of the identifications? A table for the scoring, one for the accuracy, one for the sequences, maybe hydrophobicity, and so on. Otherwise, the list of optional columns would need to be amended every now and as a result would get even longer than it is now.
The
Identification score column
solution is not so nice I think. Any ideas on how to solve this more stable? (multiple scores, etc.)
Tricky...
One could use a column like FDR q-value
, which is somewhat universal, but leaves room for interpretation HOW the FDR was calculated (many possible algorithms for the same thing exist)...
Alternatively, maybe a 'primary score' column and a 'primary score type' column which contains a CV Term, which score it is. This column would be highly redundant (containing only a single value for all rows...).. so not ideal. But I cannot think of anything more elegant when restrict ourselves to a single independent table...
Do you think it would be better to split the table into different aspects of the identifications? A table for the scoring, one for the accuracy, one for the sequences, maybe hydrophobicity, and so on. Otherwise, the list of optional columns would need to be amended every now and as a result would get even longer than it is now.
Thats fine IMHO.
I think I should note for completeness sake that we have another entry for parts of the request purpose:
id: QC:4000244
name: QC2 sample mass accuracies
...
There also is already:
[Term]
id: QC:4000243
name: Observed mass accuracies column
def: "Observed mass accuracy calculated by 1E6 x (observed mz - theoretical mz)/theoretical mz of selected peptides." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_type MS:1000014 ! accuracy
Which IMO needs its name refined (reflecting the ppm element), so it would be:
[Term]
id: QC:4000243
name: Observed mass accuracies (ppm) column
def: "Observed mass accuracy calculated by 1E6 x (observed mz - theoretical mz)/theoretical mz of selected peptides." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_type MS:1000014 ! accuracy
relationship: has_units NCIT:C48523 ! Part Per Million
synonym: Precursor error (ppm) column [EXACT]
[Term]
id: QC:4000271
name: Observed mass accuracies (da) column
def: "Observed mass accuracy calculated by subtracting observed mz from the theoretical mz." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_type MS:1000014 ! accuracy
relationship: has_units NCIT:C41127 ! Unified Atomic Mass Unit
synonym: Precursor error (ppm) column [EXACT]
[Term]
id: QC:4000272
name: Identification score column
def: "Precursor error of identifications in ppm." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_units MS:1001143 ! PSM-level search engine specific statistic
[Term]
id: QC:4000273
name: Missed cleavages column
def: "The number of missed cleavages of identifications." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_units UO:0000189 ! count unit
[Term]
id: QC:4000274
name: Identified precursor intensity column
def: "The amount of identified precursor intensity in percent. This is the percentage of precursor intensity that can be explained by is the summed intensity of MS2 peaks contributing to the identification." [PSI:QC]
is_a: QC:4000107 ! Table column type
relationship: has_units UO:0000187 ! percent
[Term]
id: QC:4000275
name: PSM detail table
def: "The fraction of identified MS2 spectra versus the total number of MS2 spectra. Not that at least MS:1000767 or MS:1003063 columns are required" [PSI:QC]
comment: A table, not specifically a metric but still often desirable to have that information in an easily accessible format at a point in any workflow, where you would/can not go back to the source files (like mzid).
is_a: QC:4000006 ! table
is_a: QC:4000009 ! ID based
relationship: has_column: QC:4000116 ! Peptide sequence column
relationship: has_optional_column: MS:1000767 ! native spectrum identifier
relationship: has_optional_column: MS:1003063 ! universal Spectrum Identifier
relationship: has_optional_column: QC:4000111 ! Protein accession column
relationship: has_optional_column: QC:4000112 ! Length value column
relationship: has_optional_column: QC:4000113 ! Target or decoy designation column
relationship: has_optional_column: QC:4000243 ! Observed mass accuracies (ppm) column
relationship: has_optional_column: QC:4000271 ! Observed mass accuracies (da) column
relationship: has_optional_column: QC:4000272 ! Identification score column
relationship: has_optional_column: QC:4000273 ! Missed cleavages column
relationship: has_optional_column: QC:4000274 ! Identified precursor intensity column
thoughts from the TC:
any column type
is fine? So simply adding a new column CV term would implicitly allow using it in the PSM detail table.With the columns you have outlined, this is basically recreating the PSM section of mzTab. That seems a bit silly to me. Can we use mzTab for the "raw" data and then list that as input file? And then the mzQC would just include the higher-level data.
This is (partially) covered by current metrics MS:4000078 | QC2 sample mass accuracies
and MS:4000079 | QC2 sample intensities
. Other information is present in alternative identification files (mzTab, mzIdentML) and doesn't need to be fully reproduced in mzQC.
Purpose: I'd like a table, not specifically a metric but still often desirable to have that identification information on PSMs in an easily accessible format at a point in any workflow, where you would/can not go back to the source files (like mzid). Say, for visualisation, or just to be able to calculate some new metrics. In essence, most columns would be optional to have the most flexibility what might be considered useful for each use case, in case a mzQC file producer is very conscious of the file size. In the same manner, this table could be held in an internally used mzQC file, and then be ditched in the final version of the mzQC, where all statistics and distributions deemed necessary are already calculated (e.g. MS2 ion collection time distribution QC:4000152 - QC:4000156)
Like this: