Open Jokendo-collab opened 4 years ago
Search times are dependent on three things:
I suspect you are performing a partially tryptic search on a large FASTA (200 MB or larger) and using several dynamic mods. I suggest you change your search to be a fully tryptic search (ntt = 2) and run a test search on one of your 70 .raw files that already finished. Compare the results: did the partially tryptic search reveal more than ~3% additional identifications?
I am, of course, just guessing here. You'll need to tell us:
/SS /QC /DI
/LC
is also usefulHi, Below is my code and I am using only two modifications (fixed and dymanic). The FASTA file is 1.5GB in size. `msgfplus=/scratch/oknjav001/bal_mzML_raw_files/databaseComparisonProject/msgfplus/searchEngine/MSGFPlus.jar
mods=/scratch/oknjav001/bal_mzML_raw_files/databaseComparisonProject/msgfplus/searchEngine/MSGFPlus_Mods1.txt
fastadb=/scratch/oknjav001/bal_mzML_raw_files/humanDatabase/fullmicribiome.fasta
cd /scratch/oknjav001/bal_mzML_raw_files/completeBAL_mzMLfiles/mzMLfile/SIM/baseline
for mzml in *.mzML
do
java -Xmx16G -jar $msgfplus -s $mzml -d $fastadb -mod $mods -inst 3 -maxMissedCleavages 1 -t 20ppm -ti -1,2 -ntt 2 -tda 1
done;
cd /scratch/oknjav001/bal_mzML_raw_files/completeBAL_mzMLfiles/mzMLfile/SIM/bcg
for mzml in *.mzML
do
java -Xmx16G -jar $msgfplus -s $mzml -d $fastadb -mod $mods -inst 3 -maxMissedCleavages 1 -t 20ppm -ti -1,2 -ntt 2 -tda 1
done; `
You are using -ntt 2
so that's good. Please paste the contents of searchEngine/MSGFPlus_Mods1.txt here
The big problem is that 1.5 GB FASTA file. I'm not sure that 16 GB is enough for it; hopefully it is. Provided Java does not report an out-of-memory exception, there really isn't much that can be done to speed up the search time: a 1.5 GB FASTA file is very large and will take time to search The only option would be to remove any dynamic mods in MSGFPlus_Mods1.txt (which is why I'm curious what it has).
Splitting the 1.5 GB FASTA file into smaller chunks (using https://github.com/PNNL-Comp-Mass-Spec/Fasta-File-Splitter ) is an option, but that won't speed up the overall search time; it's really only useful if either Java is running out of memory, or if you're able to run multiple copies of MS-GF+ simultaneously, ideally on different systems
Ah, I just noticed in #10 that the software is, in fact, crashing, and you need a copy of the Fasta-File-Splitter binary (which does work on Linux via Mono -- I just tested it).
Here you go:
Note that the Fasta-File-Splitter is a VB.NET program (while most of our software is C#). Thus, you need a new enough version of Mono that supports VB.NET (it's had support for 6+years, but package managers for older Linux distros might have an old version of mono). See https://www.mono-project.com/download/stable/
You will split the FASTA file (probably into 10 parts), then run MS-GF+ 10 times for each .mzML file. Once you have the .mzid files from all of the searches, you will need to re-combine them and re-compute EValues. For that, use the MzIdMerger:
@alchemistmatt this is the information in my modification file. NumMods=2
C2H3N1O1,C,fix,any,Carbamidomethyl # Fixed Carbamidomethyl C
O1,M,opt,any,Oxidation # Oxidation M
~
comet can runs fast after indexing the database. The indexed database includes those modifications. I think msgf+ can be much faster if in the index step modifications were included, and sorted properly, I guess...
@ATPs Implementing such an idea would be a significant amount of work.
I am running MS-GF+ and it take days to finish running. How can I shorten the sequence database search time? I have increased the number of threads to 32 and RAM to 32 GB but the search time has not reduced as I expected. Could you kindly help figure out how this can be realized? I have 70 raw files which has taken five days to run on HPC