bigbio / quantms.io

The proteomics quantification format, extending mzTab for large scale datasets.
Other
7 stars 4 forks source link

Create the first PSM example #59

Open ypriverol opened 1 month ago

ypriverol commented 1 month ago

The current version of the format is in a good 1.0beta version, we should create the first example for the PSM format.

mobiusklein commented 1 month ago

While implementing a FragPipe reader, I noticed a few things:

  1. scan_number is defined as a string, but the name implies a number. Is this intended to be a spectrum nativeID? I think the intent is for it to be a number, but this precludes the ability to denote "regions" of IM-MS frames/cycles directly.
  2. The distinction between modifications and modification_details isn't obvious. The latter may be referring to MS:1001471|peptide modification details, but I doubt this.
  3. Is peptidoform formally in ProForma 2 notation, or some other format?
  4. Going back to modifications, we can store this as a sub-structure instead of storing it as a string.
zprobot commented 1 month ago
  1. In some DIANN, the exported spectrumID format is sample=1 period=1 cycle=1 experiment=1 without information on the scan number.
  2. In the modifications details we will include in the future the scores of each phospho site. when you do phospho localization score you need to store the corresponding value for the site.
  3. Yes, peptidoform is officially represented using ProForma 2 notation.
ypriverol commented 1 month ago

@zprobot @mobiusklein scan_number is intended to be represented as USI standard, could index, or scan or nativeId.

This is the specification:

3.6.4 Use of nativeId instead of scan as an indexType

As mentioned above, the goal of the USI is to refer to an original scan event that generated a spectrum, and using the indexType “scan” is preferred. However, for some instrument types (most vendors other than Thermo Scientific), a single scan number cannot uniquely identify a spectrum, and instead a set of integers is required to identify a scan. This issue was solved in mzML (5) via the use of the nativeId mechanism. As an example, one scan event is identified in an mzML file converted from a SCIEX WIFF file with:

sample=1 period=1 cycle=2740 experiment=10

In this scenario, where reference to the original scan event is desired but a single scan number is not sufficient, the USI must be formed with a compact form of the nativeId mechanism: the tag “nativeId” MUST be placed in the indexType field, followed by a comma-separated set of integers that correspond to the full-length nativeId as indexNumber. Therefore, a USI employing this mechanism might look like:

mzspec:PXD001464:CL_1hRP_rep3:nativeId:1,1,2740,10

The number and order of the values is vendor specific and is defined by the nativeId controlled vocabulary terms in the PSI-MS controlled vocabulary as children of term MS:1000767 (http://purl.obolibrary.org/obo/MS_1000767). A few examples are provided below. See the CV for the full set:

SCIEX WIFF format (MS:1000767): sample=1 period=1 cycle=2740 experiment=10 →nativeId:1,1,2740,10

Waters nativeId format (MS:1000769): function=10 process=1 scan=345 →nativeId:10,1,345

Bruker TDF format (MS:1002818): frame=120 scan=475 →nativeId:120,475

Thermo nativeId format (MS:1000768) SHOULD NOT be expressed as a nativeId, but rather as a scan:

controllerType=0 controllerNumber=1 scan=43920 → scan:43920

since the controllerType and controllerNumber are always 0 and 1 for mass spectra. In rare cases, if either controllerType is not 0 or controllerNumber is not 1 (e.g., a PDA spectrum is being referenced), then the nativeId form MUST be used:

controllerType=5 controllerNumber=1 scan=7 → nativeId:5,1,7

The use of the scan:43920 form means that controllerType=0 controllerNumber=1.

The order of the keys is crucial and must be ordered as defined in the PSI-MS CV nativeId format. For example, the following USIs can be resolved by nativeId: mzspec:PXD001587:18302_REP2_500ng_HumanLysate_SWATH_2.mzML:nativeId:1,1,2,2:HAVSEGTK

@zprobot can you update the documentation?

About the peptidoform, can you also update the documentation that is in Proforma notation.