MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
72 stars 36 forks source link

How to get sample-level peptide identification data? #126

Closed rkimoakbioinformatics closed 2 years ago

rkimoakbioinformatics commented 2 years ago

Describe the question or problem

How to get sample-level peptide identification data?

Details

I am following Common Data Analysis Pipeline (CDAP) at https://cptac-data-portal.georgetown.edu/cptac/documents/CDAP_description_20140225.pdf to reproduce CPTAC datasets. I am using https://cptac-data-portal.georgetown.edu/study-summary/S038 as a test dataset. I downloaded an mzML file 01CPTAC_OVprospective_W_JHUZ_20161209_QE_f01.mzML and ran

java -Xmx3500M -jar ./MSGFPlus.jar -d ./proteins.fasta -t 20ppm -e 1 -m 3 -inst 3 -protocol 4 -ntt 1 -tda 1 -ti 0,1 -n 1 -maxLength 50 -mod $modpath -s 01CPTAC_OVprospective_W_JHUZ_20161209_QE_f01.mzML.mzML -o 01CPTAC_OVprospective_W_JHUZ_20161209_QE_f01.mzid

It produced the mzid file. I ran

mono ./MzidToTsvConverter/MzidToTsvConverter.exe 01CPTAC_OVprospective_W_JHUZ_20161209_QE_f01.mzid

and it produced 01CPTAC_OVprospective_W_JHUZ_20161209_QE_f01.tsv file.

The mzid and tsv files both do not have sample-level peptide identification data. However, the mzid and psm files downloaded from https://cptac-data-portal.georgetown.edu/study-summary/S038 do contain sample-level data under SpectrumIdentificationItem.

I'm new to MS-GF+. Can anyone help with this?

Useful extras

java -Xmx3500M -jar ./MSGFPlus.jar -d ./proteins.fasta -t 20ppm -e 1 -m 3 -inst 3 -protocol 4 -ntt 1 -tda 1 -ti 0,1 -n 1 -maxLength 50 -mod $modpath -s 01CPTAC_OVprospective_W_JHUZ_20161209_QE_f01.mzML.mzML -o 01CPTAC_OVprospective_W_JHUZ_20161209_QE_f01.mzid

java --version openjdk 11.0.11 2021-04-20 OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2) OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2, mixed mode, sharing)

alchemistmatt commented 2 years ago

Thank you for clearly explaining where you obtained your data files and how you analyzed them. Please clarify what you mean by the .mzid files from the CPTAC website having "sample-level peptide identification data" that is not in the .mzid files created by MS-GF+.

Do you mean the TMT reporter ion abundances? For example, this in the .mzid file:

            <userParam name="CPTAC-CDAP:TMT10-126" value="36232.4/-0.20"/>
            <userParam name="CPTAC-CDAP:TMT10-127N" value="30440/-0.42"/>
            <userParam name="CPTAC-CDAP:TMT10-127C" value="29381.2/-0.42"/>
            <userParam name="CPTAC-CDAP:TMT10-128N" value="7019.51/-0.40"/>
            <userParam name="CPTAC-CDAP:TMT10-128C" value="24031.9/-0.22"/>
            <userParam name="CPTAC-CDAP:TMT10-129N" value="13974/-0.17"/>
            <userParam name="CPTAC-CDAP:TMT10-129C" value="8465.62/-0.44"/>
            <userParam name="CPTAC-CDAP:TMT10-130N" value="37748.8/-0.42"/>
            <userParam name="CPTAC-CDAP:TMT10-130C" value="33107.2/-0.45"/>
            <userParam name="CPTAC-CDAP:TMT10-131" value="43514.9/-0.22"/>
            <userParam name="CPTAC-CDAP:TMT10-FractionOfTotalAb" value="0.0403006"/>
            <userParam name="CPTAC-CDAP:TMT10-TotalAb" value="263916"/>
and the equivalent info in the TSV file (which has extension .cap.psm) TMT10-126 TMT10-127N ... TMT10-131 TMTFlags TMT10-TotalAb TMT10-FractionOfTotalAb
11727.1/-0.20 6852.88/-0.16 ... 5462.02/-0.32 I 51338.6 0.0311579
1156.89/-0.30 1504.69/-0.32 ... 2789.99/-0.33 MI 13834.8 0.0955993
17484.2/-0.22 12126.1/-0.32 ... 11460.1/-0.25 I 92237.6 0.139702

If this is what you're referring to, that information is not something that MS-GF+ extracts. MS-GF+ only analyzes the MS/MS spectra to identify peptides. The Common Data Analysis Pipeline (CDAP) uses MS-GF+, along with other tools to extract all of this information, then package it into .mzid files and .cap.psm files.

The equivalent open-source tool that we have for extracting reporter ion information is MASIC:

The program we have for merging the MS-GF+ results with MASIC results is the MASIC Results Merger:

rkimoakbioinformatics commented 2 years ago

Thanks for your reply. Yes, I meant the TMT reporter ion abundances. I'll check out MASIC.

rkimoakbioinformatics commented 2 years ago

@alchemistmatt oh and thank you so much for detailed explanation! It helped a lot.