After removal of some sequences from db, no more matching _other_ sequences.

I have a pipeline where I search some "novel" peptides from a genomic translation, and concatenate to this DB a "normal" ENSEMBL DB to catch non-novel peptides. The strange thing is, I recently (MSGFPlus v.2019.07.03) encountered a situation where I found a spectrum that, when searched with this DB, matched to some translated and some known peptides. Fine, until I also searched with a slightly modified DB (removed some translated peptides, making the search space smaller, but kept all the ENSEMBL entries). This led for the mentioned spectrum to a PSM with the same peptide sequence, but now it did not match to any ENSEMBL protein anymore.

The only explanation I could guess was that:

the exact matching peptide was DGEEEGE, some novel peptide entries had pre=K, others pre=R
none of the ENSP proteins contained this tryptically, they all had DGEEEGETESGS... etc, and the ENSP proteins all had pre=K for their matching peptides.
in the smaller DB, the remaining novel peptides all were pre=R
Now without matches to pre=R, this did for some reason not match to any of the peptides with pre=K from ENSEMBL.

I don't know enough about the inner workings of MSGF+ to be sure if this is the expected behaviour. I uploaded two minimial example DBs. Hope this is useful or otherwise easy to explain and close :). normaldb.fa.txt smalldb.fa.txt

I encountered a similar situation in another dataset, did some more looking and now wonder about this: if the search is run with -ntt 2, how does MSGF create peptides which are: <PeptideEvidence dBSequence_ref="DBSeq578011221" peptide_ref="Pep_PGGGGEG" start="3" end="9" pre="R" post="G" isDecoy="false" id="PepEv_578011223_PGGGGEG_3"/> where post=G?

I thought it merely included the sequence because it is also contained as a fully tryptic sequence in the same DB somewhere else. But I can't fit this assumption with the absence of other sequence-matching peptides in the result (which are also semi or non-tryptic in the fasta database).

I don't see PGGGGEG in smalldb.fa.txt or normaldb.fa.txt If your FASTA file has something like this

>decoy_chr1_212897586_0000000_- chr1_212897586_0000000_-
ARPGGGGEG

I could possibly understand why MS-GF+ would identify PGGGGEG even when searching only for fully tryptic peptides using -ntt 2, since PGGGGEG appears at the end of the protein entry.

Another possibility is that two peptides got the same score for a spectrum, and one was truly fully tryptic. MS-GF+ will still (I believe) write out both peptides since they got the same score. To diagnose this fully, I'd need, at a minimum, the full .mzid file. Even better, the full input file, full FASTA file, and full set of command line args you used.

Finally, I will add that MS-GF+ works better with larger FASTA files, with a minimum of a few hundred proteins. The population statistics start to get wonky when you have very small FASTA files like the smalldb.fa.txt or normaldb.fa.txt files that you attached.

Thank you for the information, and sorry for the confusion, indeed the PGGGGEG peptide (which I will now focus on) was from a second run, and earlier I had added those databases as some kind of illustration/test case but should've realized that wasn't enough for any conclusion.

The real DB used is large, a 2GB target/decoy concatenated .fa file, the .mzML is almost 3GB, and they generate a 200MB .mzid result. I cannot upload them here (10MB limit) but I can see if I can put them somewhere else so you could download them. In the meantime, I attach an excerpt of the .mzid and .fa with the relevant matches (just in case, I know you asked for the full one).

I used version 2019.07.03, the command line was (bioconda wrapper, inside it is the normal java -jar MSGFPlus.jar):

msgf_plus -Xmx31335M -d td_concat.fa -s spectra.mzML -o "result.mzid" -thread 30 -mod TMT10_oxid_carba.txt -tda 0 -t 10.0ppm -ti -1,2 -m 0 -inst 3 -e 1 -protocol 4 -ntt 2 -minLength 7 -maxLength 50 -minCharge 2 -maxCharge 6 -n 1 -addFeatures 1

Here are the fasta and mzid files, the mzid only contains only one match (the PGGGGEG peptide) but in the fasta I have simply grepped for all occurrences of that peptide. One is fully-tryptic if you cut before proline (RPGGGGEG), most of those are completely non-tryptic except for three others which are either KPGGGGEGVMDWG, APGGGGEG, or in the case I dont understand why it is matched, RRPGGGGEGGACEE...etc PGGGGEG.fa.txt PGGGGEG.mzid.txt

I don't quite remember how I worked around this last time, but this or a similar issue popped up recently for me:

In two raw files, each with its own genomics-derived-peptide database, and canonical ENSEMBL concatenated, I got a peptide match: VTIAQGGVLPNIQAVLLP

The matched peptide in run 1:

>novpep_run1
LLGKVTIAQGGVLPNIQAVLLP

In run nr. 2:

>novpep_run2
VTIAQGGVLPNIQAVLLP

Now, in run nr. 1, but NOT in nr. 2 , it also reports matching to two canonical proteins, both of which contain the peptide in this area ...LLGKVTIAQGGVLPNIQAVLLPKKT... (full seq below). I do not understand why it doesn't also match in run nr.2, and since we use --ntt2 I'd almost expect that it wouldnt match canonical at all (since there the C-term of the peptide would be a K). At the moment I handle this by just removing the peptide, but I'd like to understand what is going on :)

My guess is that it has something to do with the run nr.1 peptide having the same N-terminal residue (pre=K) as the canonical protein, where in run nr.2 there is no residue (pre='') but I'm not sure and I don't know why this'd be important for matching.

MSGF was run as below:

msgf_plus -d db.fa -s spectra.mzML -o spectra.mzid -mod mods.txt   -tda 0 -t 10.0ppm -ti -1,2 -m 0 -inst 3 -e 1 -protocol 0     -minLength 8 -maxLength 40     -ntt 2 -minCharge 2 -maxCharge 6 -n 1 -addFeatures 1

Canonical proteins:

>ENSMUSP00000088288.3
MSGRGKQGGKARAKAKSRSSRAGLQFPVGRVHRLLRKGNYAERVGAGAPVYMAAVLEYLT
AEILELAGNAARDNKKTRIIPRHLQLAIRNDEELNKLLGKVTIAQGGVLPNIQAVLLPKK
TESHKAKSK
>ENSMUSP00000077814.5
MSGRGKQGGKARAKAKSRSSRAGLQFPVGRVHRLLRKGNYAERVGAGAPVYMAAVLEYLT
AEILELAGNAARDNKKTRIIPRHLQLAIRNDEELNKLLGKVTIAQGGVLPNIQAVLLPKK
TESHHKAKGK

I did some testing and found that it somehow depends on the preceding amino acid:

VTIAQGGVLPNIQAVLLP in a DB with ENSEMBL canonical proteins: only the peptide itself is found:

<PeptideEvidence dBSequence_ref="DBSeq1" peptide_ref="Pep_VTIAQGGVLPNIQAVLLP" start="1" end="18" pre="-" post="-" isDecoy="false" id="PepEv_1_VTIAQGGVLPNIQAVLLP_1"/>

KVTIAQGGVLPNIQAVLLP (so add a K on N-term) in the DB instead - both peptide and 2 canonical proteins hit:

<PeptideEvidence dBSequence_ref="DBSeq1" peptide_ref="Pep_VTIAQGGVLPNIQAVLLP" start="2" end="19" pre="K" post="-" isDecoy="false" id="PepEv_2_VTIAQGGVLPNIQAVLLP_2"/>
<PeptideEvidence dBSequence_ref="DBSeq14360981" peptide_ref="Pep_VTIAQGGVLPNIQAVLLP" start="101" end="118" pre="K" post="K" isDecoy="false" id="PepEv_14361081_VTIAQGGVLPNIQAVLLP_101"/>
<PeptideEvidence dBSequence_ref="DBSeq14446351" peptide_ref="Pep_VTIAQGGVLPNIQAVLLP" start="101" end="118" pre="K" post="K" isDecoy="false" id="PepEv_14446451_VTIAQGGVLPNIQAVLLP_101"/>

Both VTIAQGGVLPNIQAVLLP and KVTIAQGGVLPNIQAVLLP in the DB: all 4 (2 canonical) hits are reported to the scan.

Since the canonical proteins are not tryptically matched (they have a lysine at peptide C-term), I experimented with the number of tolerable termini, instead of using 2.

Running the first case with --ntt 1 - now also canonical proteins that have pre=R are matched, which makes sense, AND it reports more canonical proteins that have pre='K' (which I don't understand why they did not pop up in the earlier runs):


<PeptideEvidence dBSequence_ref="DBSeq1" peptide_ref="Pep_VTIAQGGVLPNIQAVLLP" start="1" end="18" pre="-" post="-" isDecoy="false" id="PepEv_1_VTIAQGGVLPNIQAVLLP_1"/>
<PeptideEvidence dBSequence_ref="DBSeq1320902" peptide_ref="Pep_VTIAQGGVLPNIQAVLLP" start="101" end="118" pre="R" post="K" isDecoy="false" id="PepEv_1321002_VTIAQGGVLPNIQAVLLP_101"/>
<PeptideEvidence dBSequence_ref="DBSeq10644463" peptide_ref="Pep_VTIAQGGVLPNIQAVLLP" start="101" end="118" pre="R" post="K" isDecoy="false" id="PepEv_10644563_VTIAQGGVLPNIQAVLLP_101"/>
<PeptideEvidence dBSequence_ref="DBSeq11675312" peptide_ref="Pep_VTIAQGGVLPNIQAVLLP" start="101" end="118" pre="R" post="K" isDecoy="false" id="PepEv_11675412_VTIAQGGVLPNIQAVLLP_101"/>
<PeptideEvidence dBSequence_ref="DBSeq11675442" peptide_ref="Pep_VTIAQGGVLPNIQAVLLP" start="101" end="118" pre="R" post="K" isDecoy="false" id="PepEv_11675542_VTIAQGGVLPNIQAVLLP_101"/>
<PeptideEvidence dBSequence_ref="DBSeq14360980" peptide_ref="Pep_VTIAQGGVLPNIQAVLLP" start="101" end="118" pre="K" post="K" isDecoy="false" id="PepEv_14361080_VTIAQGGVLPNIQAVLLP_101"/>
<PeptideEvidence dBSequence_ref="DBSeq14374552" peptide_ref="Pep_VTIAQGGVLPNIQAVLLP" start="101" end="118" pre="K" post="K" isDecoy="false" id="PepEv_14374652_VTIAQGGVLPNIQAVLLP_101"/>
<PeptideEvidence dBSequence_ref="DBSeq14446350" peptide_ref="Pep_VTIAQGGVLPNIQAVLLP" start="101" end="118" pre="K" post="K" isDecoy="false" id="PepEv_14446450_VTIAQGGVLPNIQAVLLP_101"/>

more hits...



- Running the first case with `--ntt 0` gives the exact same result, where I expected maybe also to hit an existing canonical protein which has `pre='G'`.

My theory is that in the first case, this peptide isn't found tryptically in the canonical DB (correctly, in canonical it would end in a lysine). But somehow the addition of an N-term K removes this rule? Because it's trypsinized? I hope someone can explain what is going on here :)

MSGFPlus / msgfplus

After removal of some sequences from db, no more matching _other_ sequences. #78