bzhanglab / PepQuery

PepQuery: a targeted peptide search engine
http://pepquery.org
GNU General Public License v3.0
9 stars 0 forks source link

isobaric substitutions in PepQuery #7

Closed Seb-Leb closed 5 years ago

Seb-Leb commented 5 years ago

This is a question about how PepQuery handles isobaric substitutions. We search two different spectra files for the same peptide with the same protein reference database. PepQuery indicates that a spectrum from on of the files fits better to our peptide than any proteins in the reference DB with unrestricted modification search. However in the other file, only a spectrum scored higher on a protein in the reference DB with a deamidation of N (isobaric to D) is assigned. We expected to see the PSM with the modified reference peptide in the first analysis as well but it doesn't show.

The Genome Reasearch Publication clearly describes how the scores are calculated but I have trouble understanding how the score between an isobaric PTM peptide and the queried peptide are different. Since the deamidation of N is the exact mass of D should both PSMs not be assigned the same score?

My mgf files with PepQuery results can be downloaded from here: https://drive.google.com/file/d/145Ne9qpGQTu_fpFg0jXiS3-DVHXDpJm6/view?usp=sharing (Fr4S31 is the one where our peptide is most confident and Fr2S31 is the one where a better matched N[deamidated]/D reference protein)

thank you!

wenbostar commented 5 years ago

Could you send me your command line and the reference database?

wenbostar commented 5 years ago

By the way, it looks your data is not centroid as shown below: image

Seb-Leb commented 5 years ago

Here is the command I run:

java -jar pepquery.jar -um -cpu 2 -pep Fr2S31.txt -db refprots.fasta -ms Fr2S31.mgf -o pepq_isobar_test/Fr2S31/

Do I have to pre-process the data? (we just converted it from raw with msconvert..)

wenbostar commented 5 years ago

What's the command line you used to convert raw to mgf?

Seb-Leb commented 5 years ago

My colleague did it using the GUI is it because of peak picking?

wenbostar commented 5 years ago

From raw MS/MS data to MGF data:

msconvert --filter "peakPicking true 1-2" --mgf *.raw

From mzML format data to MGF data:

msconvert --filter "peakPicking true 1-2" --mgf *.mzML

Seb-Leb commented 5 years ago

Ok thanks. I will try that.

wenbostar commented 5 years ago

My colleague did it using the GUI is it because of peak picking?

Yes. If you use msconvert GUI, make sure you select peak picking option: image

Seb-Leb commented 5 years ago

This seems to have solved the issue. thank you

Seb-Leb commented 5 years ago

Hello, I am sorry to re-open this issue but it seems that we observed the same behavior with files that are centroided. I added our reference dataset as well as two mgf files (H203.mgf if the file where PepQuery assings the best PSM to our peptide and Fr2S31.mgf is the file where PepQuery assigns a better (or equal) association score to a spectrum with an isobaric substitution on a reference protein).

gdrive share link

The command used is as mentioned before.

Thank you

wenbostar commented 5 years ago

It looks you used the default setting for most of the parameters. What type of MS machine was used to generate your data?

Seb-Leb commented 5 years ago

QExactive

wenbostar commented 5 years ago

Then you should set -itol 0.05 not the default value 0.6.

wenbostar commented 5 years ago

In addition, you can add parameter -hc with more stringent filtering.

Seb-Leb commented 5 years ago

How should we use the -hc falg with PepQuery 1.0? It dosen't appear in the options when calling pepquery.jar help

2019-08-23 13:55:26 [ INFO] Start analysis 2019-08-23 13:55:26 [DEBUG] -h java -Xmx2G -jar pepquery.jar Version 1.0.0 usage: Options -anno Annotation files folder for VCF/BED/GTF -c The max missed cleavages, default is 2 -cpu The number of cpus used, default is 1 -db Fasta format database file -e 1:Trypsin (default), 2:Trypsin (no P rule), 3:Arg-C, 4:Arg-C (no P rule), 5:Arg-N, 6:Glu-C, 7:Lys-C -f The frame to translate DNA sequence to protein. The right format is like this: "1,2,3,4,5,6","1,2,3","1". "0" means to keep the longest frame. In default, for each frame only the longest protein is used. -fixMod Fixed modification, the format is like : 1,2,3. Default is 6 (Carbamidomethylation(C)[57.02]) -fragmentMethod 1: CID/HCD (default), 2: ETD -h Help -i Take protein, DNA or VCF as input -itol Fragment ion m/z tolerance, default is 0.6da -m Scoring method: 1=HyperScore (default), 2=MVH -maxCharge The maximum charge to consider if the charge state is not available, default is 3 -maxVar Max number of variable modifications, default is 3 -minCharge The minimum charge to consider if the charge state is not available, default is 2 -minPeaks Min peaks in spectrum, default is 10 -minScore Minimum score to consider for peptide searching, default is 5 -ms Spectrum file used for identification, mgf format -n The number of random peptides, default is 1000 -o Output dir -pep Peptide sequence which you want to search -prefix Output file prefix -printPTM Print PTMs -t Input type: 1=>protein,2=>DNA,3=>VCF,4=>BED,5=>GTF -tol Precursor ion m/z tolerance, default is 10 -tolu The unit of precursor ion m/z tolerance, default is ppm -um Validation with unrestricted modification searching -varMod Variable modification, the format is like : 1,2,3. Default is 107 (Oxidation(M)[15.99])

wenbostar commented 5 years ago

Please try the new version which has this option.

Seb-Leb commented 5 years ago

perfect, thank you