MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
72 stars 36 forks source link

How to reduce MS-GF+ search time #108

Open Jokendo-collab opened 3 years ago

Jokendo-collab commented 3 years ago

I am running MS-GF+ and it take days to finish running. How can I shorten the sequence database search time? I have increased the number of threads to 32 and RAM to 32 GB but the search time has not reduced as I expected. Could you kindly help figure out how this can be realized? I have 70 raw files which has taken five days to run on HPC

alchemistmatt commented 3 years ago

Search times are dependent on three things:

I suspect you are performing a partially tryptic search on a large FASTA (200 MB or larger) and using several dynamic mods. I suggest you change your search to be a fully tryptic search (ntt = 2) and run a test search on one of your 70 .raw files that already finished. Compare the results: did the partially tryptic search reveal more than ~3% additional identifications?

I am, of course, just guessing here. You'll need to tell us:

  1. The number of MS/MS scans in one of your representative .raw files
  2. The size of your FASTA file, in MB
  3. What arguments you're using for searching, especially NTT
  4. Which dynamic modifications you're searching for (mod name and affected residues)
Jokendo-collab commented 3 years ago

Hi, Below is my code and I am using only two modifications (fixed and dymanic). The FASTA file is 1.5GB in size. `msgfplus=/scratch/oknjav001/bal_mzML_raw_files/databaseComparisonProject/msgfplus/searchEngine/MSGFPlus.jar

mods=/scratch/oknjav001/bal_mzML_raw_files/databaseComparisonProject/msgfplus/searchEngine/MSGFPlus_Mods1.txt

fastadb=/scratch/oknjav001/bal_mzML_raw_files/humanDatabase/fullmicribiome.fasta

============baseline==================================

cd /scratch/oknjav001/bal_mzML_raw_files/completeBAL_mzMLfiles/mzMLfile/SIM/baseline

for mzml in *.mzML

do

java -Xmx16G -jar $msgfplus -s $mzml -d $fastadb -mod $mods -inst 3 -maxMissedCleavages 1 -t 20ppm -ti -1,2 -ntt 2 -tda 1

done;

============================bcg=================================

cd /scratch/oknjav001/bal_mzML_raw_files/completeBAL_mzMLfiles/mzMLfile/SIM/bcg

for mzml in *.mzML

do

java -Xmx16G -jar $msgfplus -s $mzml -d $fastadb -mod $mods -inst 3 -maxMissedCleavages 1 -t 20ppm -ti -1,2 -ntt 2 -tda 1

done; `

alchemistmatt commented 3 years ago

You are using -ntt 2 so that's good. Please paste the contents of searchEngine/MSGFPlus_Mods1.txt here

The big problem is that 1.5 GB FASTA file. I'm not sure that 16 GB is enough for it; hopefully it is. Provided Java does not report an out-of-memory exception, there really isn't much that can be done to speed up the search time: a 1.5 GB FASTA file is very large and will take time to search The only option would be to remove any dynamic mods in MSGFPlus_Mods1.txt (which is why I'm curious what it has).

Splitting the 1.5 GB FASTA file into smaller chunks (using https://github.com/PNNL-Comp-Mass-Spec/Fasta-File-Splitter ) is an option, but that won't speed up the overall search time; it's really only useful if either Java is running out of memory, or if you're able to run multiple copies of MS-GF+ simultaneously, ideally on different systems

alchemistmatt commented 3 years ago

Ah, I just noticed in #10 that the software is, in fact, crashing, and you need a copy of the Fasta-File-Splitter binary (which does work on Linux via Mono -- I just tested it).

Here you go:

Note that the Fasta-File-Splitter is a VB.NET program (while most of our software is C#). Thus, you need a new enough version of Mono that supports VB.NET (it's had support for 6+years, but package managers for older Linux distros might have an old version of mono). See https://www.mono-project.com/download/stable/

You will split the FASTA file (probably into 10 parts), then run MS-GF+ 10 times for each .mzML file. Once you have the .mzid files from all of the searches, you will need to re-combine them and re-compute EValues. For that, use the MzIdMerger:

Jokendo-collab commented 3 years ago

@alchemistmatt this is the information in my modification file. NumMods=2

C2H3N1O1,C,fix,any,Carbamidomethyl # Fixed Carbamidomethyl C

Variable Modifications (default: none)

O1,M,opt,any,Oxidation # Oxidation M

15.994915,M,opt,any,Oxidation # Oxidation M (mass is used instead of CompositionStr)

H-1N-1O1,NQ,opt,any,Deamidated # Negative numbers are allowed.

C2H3NO,*,opt,N-term,Carbamidomethyl # Variable Carbamidomethyl N-term

H-2O-1,E,opt,N-term,Glu->pyro-Glu # Pyro-glu from E

H-3N-1,Q,opt,N-term,Gln->pyro-Glu # Pyro-glu from Q

C2H2O,*,opt,Prot-N-term,Acetyl # Acetylation Protein N-term

C2H2O1,K,opt,any,Acetyl # Acetylation K

CH2,K,opt,any,Methyl # Methylation K

HO3P,STY,opt,any,Phospho # Phosphorylation STY

~

ATPs commented 3 years ago

comet can runs fast after indexing the database. The indexed database includes those modifications. I think msgf+ can be much faster if in the index step modifications were included, and sorted properly, I guess...

FarmGeek4Life commented 3 years ago

@ATPs Implementing such an idea would be a significant amount of work.