Jokendo-collab commented 3 years ago

I am running MS-GF+ and it take days to finish running. How can I shorten the sequence database search time? I have increased the number of threads to 32 and RAM to 32 GB but the search time has not reduced as I expected. Could you kindly help figure out how this can be realized? I have 70 raw files which has taken five days to run on HPC

alchemistmatt commented 3 years ago

Search times are dependent on three things:

Fully tryptic vs. partially search
FASTA file size
Number of dynamic modifications

I suspect you are performing a partially tryptic search on a large FASTA (200 MB or larger) and using several dynamic mods. I suggest you change your search to be a fully tryptic search (ntt = 2) and run a test search on one of your 70 .raw files that already finished. Compare the results: did the partially tryptic search reveal more than ~3% additional identifications?

I am, of course, just guessing here. You'll need to tell us:

The number of MS/MS scans in one of your representative .raw files
- Use MSFileInfoScanner to help determine scan counts: https://github.com/PNNL-Comp-Mass-Spec/MS-File-Info-Scanner/releases
- When running, use options /SS /QC /DI
- /LC is also useful
The size of your FASTA file, in MB
What arguments you're using for searching, especially NTT
Which dynamic modifications you're searching for (mod name and affected residues)

Jokendo-collab commented 3 years ago

Hi, Below is my code and I am using only two modifications (fixed and dymanic). The FASTA file is 1.5GB in size. `msgfplus=/scratch/oknjav001/bal_mzML_raw_files/databaseComparisonProject/msgfplus/searchEngine/MSGFPlus.jar

mods=/scratch/oknjav001/bal_mzML_raw_files/databaseComparisonProject/msgfplus/searchEngine/MSGFPlus_Mods1.txt

fastadb=/scratch/oknjav001/bal_mzML_raw_files/humanDatabase/fullmicribiome.fasta

============baseline==================================

cd /scratch/oknjav001/bal_mzML_raw_files/completeBAL_mzMLfiles/mzMLfile/SIM/baseline

for mzml in *.mzML

do

java -Xmx16G -jar $msgfplus -s $mzml -d $fastadb -mod $mods -inst 3 -maxMissedCleavages 1 -t 20ppm -ti -1,2 -ntt 2 -tda 1

done;

============================bcg=================================

cd /scratch/oknjav001/bal_mzML_raw_files/completeBAL_mzMLfiles/mzMLfile/SIM/bcg

for mzml in *.mzML

do

java -Xmx16G -jar $msgfplus -s $mzml -d $fastadb -mod $mods -inst 3 -maxMissedCleavages 1 -t 20ppm -ti -1,2 -ntt 2 -tda 1

done; `

alchemistmatt commented 3 years ago

You are using -ntt 2 so that's good. Please paste the contents of searchEngine/MSGFPlus_Mods1.txt here

The big problem is that 1.5 GB FASTA file. I'm not sure that 16 GB is enough for it; hopefully it is. Provided Java does not report an out-of-memory exception, there really isn't much that can be done to speed up the search time: a 1.5 GB FASTA file is very large and will take time to search The only option would be to remove any dynamic mods in MSGFPlus_Mods1.txt (which is why I'm curious what it has).

Splitting the 1.5 GB FASTA file into smaller chunks (using https://github.com/PNNL-Comp-Mass-Spec/Fasta-File-Splitter ) is an option, but that won't speed up the overall search time; it's really only useful if either Java is running out of memory, or if you're able to run multiple copies of MS-GF+ simultaneously, ideally on different systems

alchemistmatt commented 3 years ago

Ah, I just noticed in #10 that the software is, in fact, crashing, and you need a copy of the Fasta-File-Splitter binary (which does work on Linux via Mono -- I just tested it).

Here you go:

https://github.com/PNNL-Comp-Mass-Spec/Fasta-File-Splitter/releases/

Note that the Fasta-File-Splitter is a VB.NET program (while most of our software is C#). Thus, you need a new enough version of Mono that supports VB.NET (it's had support for 6+years, but package managers for older Linux distros might have an old version of mono). See https://www.mono-project.com/download/stable/

You will split the FASTA file (probably into 10 parts), then run MS-GF+ 10 times for each .mzML file. Once you have the .mzid files from all of the searches, you will need to re-combine them and re-compute EValues. For that, use the MzIdMerger:

https://github.com/PNNL-Comp-Mass-Spec/MzidMerger/releases/

Jokendo-collab commented 3 years ago

@alchemistmatt this is the information in my modification file. NumMods=2

C2H3N1O1,C,fix,any,Carbamidomethyl # Fixed Carbamidomethyl C

Variable Modifications (default: none)

O1,M,opt,any,Oxidation # Oxidation M

15.994915,M,opt,any,Oxidation # Oxidation M (mass is used instead of CompositionStr)

H-1N-1O1,NQ,opt,any,Deamidated # Negative numbers are allowed.

C2H3NO,*,opt,N-term,Carbamidomethyl # Variable Carbamidomethyl N-term

H-2O-1,E,opt,N-term,Glu->pyro-Glu # Pyro-glu from E

H-3N-1,Q,opt,N-term,Gln->pyro-Glu # Pyro-glu from Q

C2H2O,*,opt,Prot-N-term,Acetyl # Acetylation Protein N-term

C2H2O1,K,opt,any,Acetyl # Acetylation K

CH2,K,opt,any,Methyl # Methylation K

HO3P,STY,opt,any,Phospho # Phosphorylation STY

~

ATPs commented 3 years ago

comet can runs fast after indexing the database. The indexed database includes those modifications. I think msgf+ can be much faster if in the index step modifications were included, and sorted properly, I guess...

FarmGeek4Life commented 3 years ago

@ATPs Implementing such an idea would be a significant amount of work.

MSGFPlus / msgfplus

How to reduce MS-GF+ search time #108

============baseline==================================

============================bcg=================================

Variable Modifications (default: none)

15.994915,M,opt,any,Oxidation # Oxidation M (mass is used instead of CompositionStr)

H-1N-1O1,NQ,opt,any,Deamidated # Negative numbers are allowed.

C2H3NO,*,opt,N-term,Carbamidomethyl # Variable Carbamidomethyl N-term

H-2O-1,E,opt,N-term,Glu->pyro-Glu # Pyro-glu from E

H-3N-1,Q,opt,N-term,Gln->pyro-Glu # Pyro-glu from Q

C2H2O,*,opt,Prot-N-term,Acetyl # Acetylation Protein N-term

C2H2O1,K,opt,any,Acetyl # Acetylation K

CH2,K,opt,any,Methyl # Methylation K

HO3P,STY,opt,any,Phospho # Phosphorylation STY