MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
77 stars 36 forks source link

After removal of some sequences from db, no more matching _other_ sequences. #78

Open glormph opened 5 years ago

glormph commented 5 years ago

I have a pipeline where I search some "novel" peptides from a genomic translation, and concatenate to this DB a "normal" ENSEMBL DB to catch non-novel peptides. The strange thing is, I recently (MSGFPlus v.2019.07.03) encountered a situation where I found a spectrum that, when searched with this DB, matched to some translated and some known peptides. Fine, until I also searched with a slightly modified DB (removed some translated peptides, making the search space smaller, but kept all the ENSEMBL entries). This led for the mentioned spectrum to a PSM with the same peptide sequence, but now it did not match to any ENSEMBL protein anymore.

The only explanation I could guess was that:

I don't know enough about the inner workings of MSGF+ to be sure if this is the expected behaviour. I uploaded two minimial example DBs. Hope this is useful or otherwise easy to explain and close :). normaldb.fa.txt smalldb.fa.txt

glormph commented 5 years ago

I encountered a similar situation in another dataset, did some more looking and now wonder about this: if the search is run with -ntt 2, how does MSGF create peptides which are: <PeptideEvidence dBSequence_ref="DBSeq578011221" peptide_ref="Pep_PGGGGEG" start="3" end="9" pre="R" post="G" isDecoy="false" id="PepEv_578011223_PGGGGEG_3"/> where post=G?

I thought it merely included the sequence because it is also contained as a fully tryptic sequence in the same DB somewhere else. But I can't fit this assumption with the absence of other sequence-matching peptides in the result (which are also semi or non-tryptic in the fasta database).

alchemistmatt commented 5 years ago

I don't see PGGGGEG in smalldb.fa.txt or normaldb.fa.txt If your FASTA file has something like this

>decoy_chr1_212897586_0000000_- chr1_212897586_0000000_-
ARPGGGGEG

I could possibly understand why MS-GF+ would identify PGGGGEG even when searching only for fully tryptic peptides using -ntt 2, since PGGGGEG appears at the end of the protein entry.

Another possibility is that two peptides got the same score for a spectrum, and one was truly fully tryptic. MS-GF+ will still (I believe) write out both peptides since they got the same score. To diagnose this fully, I'd need, at a minimum, the full .mzid file. Even better, the full input file, full FASTA file, and full set of command line args you used.

Finally, I will add that MS-GF+ works better with larger FASTA files, with a minimum of a few hundred proteins. The population statistics start to get wonky when you have very small FASTA files like the smalldb.fa.txt or normaldb.fa.txt files that you attached.

glormph commented 5 years ago

Thank you for the information, and sorry for the confusion, indeed the PGGGGEG peptide (which I will now focus on) was from a second run, and earlier I had added those databases as some kind of illustration/test case but should've realized that wasn't enough for any conclusion.

The real DB used is large, a 2GB target/decoy concatenated .fa file, the .mzML is almost 3GB, and they generate a 200MB .mzid result. I cannot upload them here (10MB limit) but I can see if I can put them somewhere else so you could download them. In the meantime, I attach an excerpt of the .mzid and .fa with the relevant matches (just in case, I know you asked for the full one).

I used version 2019.07.03, the command line was (bioconda wrapper, inside it is the normal java -jar MSGFPlus.jar):

msgf_plus -Xmx31335M -d td_concat.fa -s spectra.mzML -o "result.mzid" -thread 30 -mod TMT10_oxid_carba.txt -tda 0 -t 10.0ppm -ti -1,2 -m 0 -inst 3 -e 1 -protocol 4 -ntt 2 -minLength 7 -maxLength 50 -minCharge 2 -maxCharge 6 -n 1 -addFeatures 1

Here are the fasta and mzid files, the mzid only contains only one match (the PGGGGEG peptide) but in the fasta I have simply grepped for all occurrences of that peptide. One is fully-tryptic if you cut before proline (RPGGGGEG), most of those are completely non-tryptic except for three others which are either KPGGGGEGVMDWG, APGGGGEG, or in the case I dont understand why it is matched, RRPGGGGEGGACEE...etc PGGGGEG.fa.txt PGGGGEG.mzid.txt

glormph commented 1 year ago

I don't quite remember how I worked around this last time, but this or a similar issue popped up recently for me:

In two raw files, each with its own genomics-derived-peptide database, and canonical ENSEMBL concatenated, I got a peptide match: VTIAQGGVLPNIQAVLLP

The matched peptide in run 1:

>novpep_run1
LLGKVTIAQGGVLPNIQAVLLP

In run nr. 2:

>novpep_run2
VTIAQGGVLPNIQAVLLP

Now, in run nr. 1, but NOT in nr. 2 , it also reports matching to two canonical proteins, both of which contain the peptide in this area ...LLGKVTIAQGGVLPNIQAVLLPKKT... (full seq below). I do not understand why it doesn't also match in run nr.2, and since we use --ntt2 I'd almost expect that it wouldnt match canonical at all (since there the C-term of the peptide would be a K). At the moment I handle this by just removing the peptide, but I'd like to understand what is going on :)

My guess is that it has something to do with the run nr.1 peptide having the same N-terminal residue (pre=K) as the canonical protein, where in run nr.2 there is no residue (pre='') but I'm not sure and I don't know why this'd be important for matching.

MSGF was run as below:

msgf_plus -d db.fa -s spectra.mzML -o spectra.mzid -mod mods.txt   -tda 0 -t 10.0ppm -ti -1,2 -m 0 -inst 3 -e 1 -protocol 0     -minLength 8 -maxLength 40     -ntt 2 -minCharge 2 -maxCharge 6 -n 1 -addFeatures 1

Canonical proteins:

>ENSMUSP00000088288.3
MSGRGKQGGKARAKAKSRSSRAGLQFPVGRVHRLLRKGNYAERVGAGAPVYMAAVLEYLT
AEILELAGNAARDNKKTRIIPRHLQLAIRNDEELNKLLGKVTIAQGGVLPNIQAVLLPKK
TESHKAKSK
>ENSMUSP00000077814.5
MSGRGKQGGKARAKAKSRSSRAGLQFPVGRVHRLLRKGNYAERVGAGAPVYMAAVLEYLT
AEILELAGNAARDNKKTRIIPRHLQLAIRNDEELNKLLGKVTIAQGGVLPNIQAVLLPKK
TESHHKAKGK
glormph commented 1 year ago

I did some testing and found that it somehow depends on the preceding amino acid:

Since the canonical proteins are not tryptically matched (they have a lysine at peptide C-term), I experimented with the number of tolerable termini, instead of using 2.

more hits...



- Running the first case with `--ntt 0` gives the exact same result, where I expected maybe also to hit an existing canonical protein which has `pre='G'`.

My theory is that in the first case, this peptide isn't found tryptically in the canonical DB (correctly, in canonical it would end in a lysine). But somehow the addition of an N-term K removes this rule? Because it's trypsinized? I hope someone can explain what is going on here :)