MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
76 stars 36 forks source link

MSGFPlus throws java.io.EOFException #104

Closed VladimirRoudko closed 4 years ago

VladimirRoudko commented 4 years ago

Hello,

I have stumbled over the issue that Search pipeline throws java.io.EOFException upon search with MS-GF+. My configuration: Ubuntu, command-line version of the search/pipetideshaker installation; 10 threads per run, Total RAM 256Gb, search against human proteome (+ decoys) prec_tol 0.01 Da, frag_tol 0.5 Da

Just FYI:

The error seems to be very inconsistent: Some *.mgf files are getting processed by MS-GF+, while some are throwing this issue. I am puzzled what could be the reasons for it.

Here is the full STDOUT of the command:

ms-gf+ command: /usr/lib/jvm/java-11-openjdk-amd64/bin/java -Xmx20G -jar /nfs/scratch/vova/compomics/pipeline/protein/compomics_result/04CPTAC_COprospective_W_VU_20150915_05CO006/SearchGUI-3.3.20/resources/MS-GF+/MSGFPlus.jar -s /nfs/scratch/vova/compomics/pipeline/protein/compomics_result/04CPTAC_COprospective_W_VU_20150915_05CO006/mgf_input/04CPTAC_COprospective_W_VU_20150915_05CO006_f01.mgf -d /nfs/scratch/vova/compomics/pipeline/protein/peptide_database/human.proteome.10_concatenated_target_decoy.fasta -o /nfs/scratch/vova/compomics/pipeline/protein/compomics_result/04CPTAC_COprospective_W_VU_20150915_05CO006/search_output/.SearchGUI_temp/04CPTAC_COprospective_W_VU_20150915_05CO006_f01.msgf.mzid -t 0.01Da -tda 0 -mod /nfs/scratch/vova/compomics/pipeline/protein/compomics_result/04CPTAC_COprospective_W_VU_20150915_05CO006/SearchGUI-3.3.20/resources/MS-GF+/params/Mods.txt -minCharge 2 -maxCharge 4 -inst 3 -thread 10 -m 3 -e 1 -ntt 2 -protocol 0 -minLength 8 -maxLength 30 -n 10 -addFeatures 0 -ti 0,1

Mon Jun 22 09:47:57 EDT 2020 Processing 04CPTAC_COprospective_W_VU_20150915_05CO006_f01.mgf with MS-GF+.

MS-GF+ Release (v2018.04.09) (9 April 2018) Loading database files... Warning: Sequence database contains 164 counts of letter 'U', which does not correspond to an amino acid. Warning: Sequence database contains 28 counts of letter 'X', which does not correspond to an amino acid. Creating the suffix array indexed file... Size: 149002343 AlphabetSize: 28 Suffix creation: 0.00% complete. Suffix creation: 6.71% complete. Suffix creation: 13.42% complete. Suffix creation: 20.13% complete. Suffix creation: 26.85% complete. Suffix creation: 33.56% complete. Suffix creation: 40.27% complete. Suffix creation: 46.98% complete. Suffix creation: 53.69% complete. Suffix creation: 60.40% complete. Suffix creation: 67.11% complete. Suffix creation: 73.82% complete. Suffix creation: 80.54% complete. Suffix creation: 87.25% complete. Suffix creation: 93.96% complete. Sorting 0.00% complete. Sorting 5.81% complete. Sorting 11.62% complete. Sorting 17.43% complete. Sorting 23.24% complete. Sorting 29.05% complete. Sorting 34.86% complete. Sorting 40.67% complete. Sorting 46.48% complete. Sorting 52.29% complete. Sorting 58.10% complete. Sorting 63.91% complete. Sorting 69.73% complete. Sorting 75.54% complete. Sorting 81.35% complete. Sorting 87.16% complete. Sorting 92.97% complete. Sorting 98.78% complete. java.io.EOFException at java.base/java.io.DataInputStream.readInt(DataInputStream.java:397) at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.computeNumDistinctPeptides(CompactSuffixArray.java:195) at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.(CompactSuffixArray.java:112) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:207) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:105) at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:56)

Mon Jun 22 09:49:57 EDT 2020 MS-GF+ finished for /nfs/scratch/vova/compomics/pipeline/protein/compomics_result/04CPTAC_COprospective_W_VU_20150915_05CO006/mgf_input/04CPTAC_COprospective_W_VU_20150915_05CO006_f01.mgf (2 minutes 586.0 milliseconds).

VladimirRoudko commented 4 years ago

Also, I am alocating 20Gb RAM in JAVA_OPTS="-Xmx20G"

alchemistmatt commented 4 years ago

MS-GF+ indexes the FASTA file before it starts searching spectra. Note that once a FASTA file has been indexed, the next time you run MS-GF+ and use the same FASTA file, the existing index files will be re-used, provided you didn't delete them.

My guess is that you're running multiple copies of MS-GF+ simultaneously and multiple copies are generating index files in the same location on disk. The exception details indicate that the code is reading an index file and a nlcp file and counting peptides. The error occurs when it tries to read an integer, but the read pointer has reached the end of the file. This could happen if one instance creates the file, then tries to re-read it, while another instance is re-creating the file.

Also, if possible, you should be using the latest release of MS-GF+, available at https://github.com/MSGFPlus/msgfplus/releases

Note that you can create the index files without running MS-GF+ ; see: https://msgfplus.github.io/msgfplus/BuildSA.html

Regarding memory requirements, as the FASTA file size increases, MS-GF+ needs more memory. For the human proteome, you could probably get by with 4 GB and definitely 8 GB. You're using 20 GB, which is far more than needed, but doesn't hurt anything either.

VladimirRoudko commented 4 years ago

HI ! thanks for the quick response! Indeed, I do run multiple independently installed copies of MS-GF simultaneously which are linked to the same copy of the reference database - that indeed may create the conflicts. I will test the run with each independent copy of the reference database and share the results

thank you! Vladimir

VladimirRoudko commented 4 years ago

Hello Matt,

indeed, the problem was the index files produced at the same place for multiple independent copies of the MS-GF+. Just running the pipeline with independent copies of the same reference proteome solves the issue. Thank you for help and closing the issue then!

Vladimir