MS-GF+ crashes for large FASTA files

sampie commented 7 years ago

MS-GF+ seems to crash while analyzing data and gives following error message:

MS-GF+ Release (v2017.01.27) (27 Jan 2017) Loading database files... Warning: Sequence database contains 74 counts of letter 'U', which does not correspond to an amino acid. Loading database finished (elapsed time: 353.19 sec) Reading spectra... com.ctc.wstx.exc.WstxParsingException: Undeclared namespace prefix "xsi" (for attribute "nil") at [row,col {unknown-source}]: [21,35] at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:614) at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:487) at com.ctc.wstx.sr.AttributeCollector.resolveNamespaces(AttributeCollector.java:950) at com.ctc.wstx.sr.InputElementStack.resolveAndValidateElement(InputElementStack.java:511) at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2975) at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2835) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069) at org.systemsbiology.jrap.stax.ScanAndHeaderParser.parseScanAndHeader(ScanAndHeaderParser.java:114) at org.systemsbiology.jrap.stax.ScanAndHeaderParser.parseScanAndHeader(ScanAndHeaderParser.java:73) at org.systemsbiology.jrap.stax.MSXMLParser.rap(MSXMLParser.java:274) at edu.ucsd.msjava.parser.MzXMLSpectraMap.getSpectrumByScanNum(MzXMLSpectraMap.java:63) at edu.ucsd.msjava.parser.MzXMLSpectraMap.getSpectrumBySpecIndex(MzXMLSpectraMap.java:138) at edu.ucsd.msjava.parser.MzXMLSpectraIterator.parseNextSpectrum(MzXMLSpectraIterator.java:70) at edu.ucsd.msjava.parser.MzXMLSpectraIterator.next(MzXMLSpectraIterator.java:51) at edu.ucsd.msjava.parser.MzXMLSpectraIterator.next(MzXMLSpectraIterator.java:14) at edu.ucsd.msjava.msutil.SpecKey.getSpecKeyList(SpecKey.java:82) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:221) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:105) at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:56)

FarmGeek4Life commented 7 years ago

This has recently been caused by mzXML files that were created by ProteoWizard, where there is no peak data for a scan. If you use an older version of ProteoWizard (like 3.0.6xxx), it doesn't use the "xsi:nil" attribute that is causing this issue. The code needs to be updated, but that involves changing the library used to read mzXML.

On Apr 13, 2017 12:33 AM, "Sami Pietilä" notifications@github.com wrote:

MS-GF+ seems to crash while analyzing data and gives following error message:

MS-GF+ Release (v2017.01.27) (27 Jan 2017) Loading database files... Warning: Sequence database contains 74 counts of letter 'U', which does not correspond to an amino acid. Loading database finished (elapsed time: 353.19 sec) Reading spectra... com.ctc.wstx.exc.WstxParsingException: Undeclared namespace prefix "xsi" (for attribute "nil") at [row,col {unknown-source}]: [21,35] at com.ctc.wstx.sr.StreamScanner.constructWfcException( StreamScanner.java:614) at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:487) at com.ctc.wstx.sr.AttributeCollector.resolveNamespaces( AttributeCollector.java:950) at com.ctc.wstx.sr.InputElementStack.resolveAndValidateElement( InputElementStack.java:511) at com.ctc.wstx.sr.BasicStreamReader.handleStartElem( BasicStreamReader.java:2975) at com.ctc.wstx.sr.BasicStreamReader.nextFromTree( BasicStreamReader.java:2835) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069) at org.systemsbiology.jrap.stax.ScanAndHeaderParser.parseScanAndHeader( ScanAndHeaderParser.java:114) at org.systemsbiology.jrap.stax.ScanAndHeaderParser.parseScanAndHeader( ScanAndHeaderParser.java:73) at org.systemsbiology.jrap.stax.MSXMLParser.rap(MSXMLParser.java:274) at edu.ucsd.msjava.parser.MzXMLSpectraMap.getSpectrumByScanNum( MzXMLSpectraMap.java:63) at edu.ucsd.msjava.parser.MzXMLSpectraMap.getSpectrumBySpecIndex( MzXMLSpectraMap.java:138) at edu.ucsd.msjava.parser.MzXMLSpectraIterator.parseNextSpectrum( MzXMLSpectraIterator.java:70) at edu.ucsd.msjava.parser.MzXMLSpectraIterator.next( MzXMLSpectraIterator.java:51) at edu.ucsd.msjava.parser.MzXMLSpectraIterator.next( MzXMLSpectraIterator.java:14) at edu.ucsd.msjava.msutil.SpecKey.getSpecKeyList(SpecKey.java:82) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:221) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:105) at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:56)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sangtaekim/msgfplus/issues/10, or mute the thread https://github.com/notifications/unsubscribe-auth/AF779X6UuKZanAegK5EwGPqOsHDFTndKks5rvc-wgaJpZM4M8WkY .

sampie commented 7 years ago

Hi,

I did try to reconvert with the latest proteowizard. The mzXML files say that proteowizard version is 3.0.10738. I am getting still the same error.

Is there a workaround for this problem? I did try to use mzML, but with that I get also error messages:

Search progress: 59 / 60 tasks, 100.00% 26.43 minutes elapsed Search progress: 60 / 60 tasks, 100.00% 26.43 minutes elapsed Writing results... java.lang.NullPointerException at edu.ucsd.msjava.mzid.MZIdentMLGen.getDBSequence(MZIdentMLGen.java:661) at edu.ucsd.msjava.mzid.MZIdentMLGen.getPeptideEvidenceList(MZIdentMLGen.java:619) at edu.ucsd.msjava.mzid.MZIdentMLGen.addSpectrumIdentificationResults(MZIdentMLGen.java:347) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:385) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:105) at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:56)

Search progress: 59 / 60 tasks, 100.00% 20.56 minutes elapsed Search progress: 60 / 60 tasks, 100.00% 20.56 minutes elapsed Writing results... java.lang.ArrayIndexOutOfBoundsException: -108 at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.getSubsequence(CompactFastaSequence.java:170) at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.getMatchingEntry(CompactFastaSequence.java:236) at edu.ucsd.msjava.mzid.MZIdentMLGen.getDBSequence(MZIdentMLGen.java:658) at edu.ucsd.msjava.mzid.MZIdentMLGen.getPeptideEvidenceList(MZIdentMLGen.java:619) at edu.ucsd.msjava.mzid.MZIdentMLGen.addSpectrumIdentificationResults(MZIdentMLGen.java:347) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:385) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:105) at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:56)

cctsou commented 7 years ago

I am having the same issue too. Could you please guide me where can I find old version of ProteoWizard?

FarmGeek4Life commented 7 years ago

ProteoWizard version v3.0.9134 is when the change in MSConvert was introduced, so v3.0.9133 or earlier is currently needed to produce a mzXML file with empty scans that will not produce this error. Another workaround is to convert the file to mzML using ProteoWizard, and running the MS-GF+ search on the mzML file; however, that may cause problems further down whatever pipeline you are using. I don't know where you could find an old enough version of ProteoWizard online; their official archive source doesn't go back far enough anymore (it only keeps something like the latest 200 versions)

FarmGeek4Life commented 7 years ago

Mentioned by Matt Chambers on the ProteoWizard support mailing list:

The easy fix here is to filter out empty spectra (--filter "defaultArrayLength 1-"). Or use mzML. :)

FarmGeek4Life commented 7 years ago

Is there a workaround for this problem? I did try to use mzML, but with that I get also error messages:

Search progress: 59 / 60 tasks, 100.00% 26.43 minutes elapsed Search progress: 60 / 60 tasks, 100.00% 26.43 minutes elapsed Writing results... java.lang.NullPointerException at edu.ucsd.msjava.mzid.MZIdentMLGen.getDBSequence(MZIdentMLGen.java:661) at edu.ucsd.msjava.mzid.MZIdentMLGen.getPeptideEvidenceList(MZIdentMLGen.java:619) at edu.ucsd.msjava.mzid.MZIdentMLGen.addSpectrumIdentificationResults(MZIdentMLGen.java:347) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:385) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:105) at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:56)

Search progress: 59 / 60 tasks, 100.00% 20.56 minutes elapsed Search progress: 60 / 60 tasks, 100.00% 20.56 minutes elapsed Writing results... java.lang.ArrayIndexOutOfBoundsException: -108 at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.getSubsequence(CompactFastaSequence.java:170) at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.getMatchingEntry(CompactFastaSequence.java:236) at edu.ucsd.msjava.mzid.MZIdentMLGen.getDBSequence(MZIdentMLGen.java:658) at edu.ucsd.msjava.mzid.MZIdentMLGen.getPeptideEvidenceList(MZIdentMLGen.java:619) at edu.ucsd.msjava.mzid.MZIdentMLGen.addSpectrumIdentificationResults(MZIdentMLGen.java:347) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:385) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:105) at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:56)

This is probably an issue with the Fasta database used. I would first try deleting all of the indexing files MS-GF+ creates and then re-run MS-GF+ (.cnlcp, .canno, .csarr, .cseq, *.revCat.fasta).

FarmGeek4Life commented 7 years ago

Looking closer at this issue (testing with a file that has the same problem) I noticed that this issue occurs during the "reading spectra" portion of the search, when it is pre-reading all spectra in the file for filtering purposes (and to properly divide up the number of spectra into the number of threads/tasks). What should occur in this case is that those spectra are not read later in the search, and the search should complete afterwards without issue. I have checked switching over to a different mzXML reader, and it exhibits the same issue, while requiring more code to read the file for multithreaded searches (the currently used mzXML parser code is inherently threadsafe, the one I tested is not). If it is the case that this only happens during the mzXML preprocessing steps, then we can "solve" this issue by suppressing the output of these errors.

sampie commented 5 years ago

Hi

This issue still seems to be valid for the latest MS-GF+ Release (v2018.09.12) (12 September 2018)

sampie commented 5 years ago

Hi

Also with mzML the crash still happens.

java.lang.ArrayIndexOutOfBoundsException: -80 at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.getSubsequence(CompactFastaSequence.java:235) at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.getMatchingEntry(CompactFastaSequence.java:301) at edu.ucsd.msjava.mzid.MZIdentMLGen.getDBSequence(MZIdentMLGen.java:701) at edu.ucsd.msjava.mzid.MZIdentMLGen.getPeptideEvidenceList(MZIdentMLGen.java:651) at edu.ucsd.msjava.mzid.MZIdentMLGen.addSpectrumIdentificationResults(MZIdentMLGen.java:369) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:396) at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:106) at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:57)

Thanks

FarmGeek4Life commented 5 years ago

What version of java are you using, and is it 32-bit or 64-bit? Also, how big is your Fasta file?

sampie commented 5 years ago

Hi,

I am using 64 bit java.

openjdk version "1.8.0_181" OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-0ubuntu0.18.04.1-b13) OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

The fasta file size is 6.5G

Thanks

alchemistmatt commented 5 years ago

That FASTA file size is enormous; our typical upper-bound is 500 MB. When we have FASTA files over 500 MB in size, we typically split them into separate files, then run MS-GF+ against each FASTA file, then combine the results by keeping the highest scoring peptide for each scan number. The problem with large FASTA files is that MS-GF+ caches the protein sequence data and the index files in memory, and a 5 GB file might require 30 GB of ram for all of the cached data (likely more). Thus, given your FASTA file size, I'm not surprised by the errors you're seeing. The tool we use to split FASTA files is https://github.com/PNNL-Comp-Mass-Spec/Fasta-File-Splitter If you think that would be useful for you, I could make a binary of that console app available.

All of that being said, I see that this bug has been affecting you since April 2017. Were you using a large FASTA file back then too?

sampie commented 5 years ago

Hi,

I think memory usage is not anywhere near the max. I have about 500GB of ram.

If MSGF+ limits the ram usage for some reason, is there a parameter that I can tune to say that there is not need to limit ram usage?

Yes, I have been using the same fasta all this time. I have been using x!tandem and comet and I am hoping to add also MSGF+ search results into the mix.

Thanks

FarmGeek4Life commented 5 years ago

Looking at some of the internal implementation of how the fasta file is handled, what is occurring is most likely caused by part of the MS-GF+ fasta file reading implementation; it appears that MS-GF+ reads the entire fasta file into memory, in a byte array. However, the java array size limitation is 2^31 indices, or approximately 2G for a byte array.

If you are just doing a target (or decoy) search, then you probably need to split that 6.5G fasta file into 4 parts; if you are doing a target+decoy search, then it should be 8. After performing the split searches, you can then merge the result files back together using MzidMerger, which will run on Linux using Mono.

sampie commented 5 years ago

I am wondering what might happen if this byte array was replaced with 64 bit implementation. I found this library that seems to support big structures https://github.com/vigna/fastutil.

Where in the MS-GF+ source code this fasta array is?

FarmGeek4Life commented 5 years ago

As the stacktrace says, at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.getSubsequence(CompactFastaSequence.java:235

alchemistmatt commented 5 years ago

It's here: https://github.com/MSGFPlus/msgfplus/blob/master/src/main/java/edu/ucsd/msjava/msdbsearch/CompactFastaSequence.java#L235 Actually, notice those are casting to (int) That looks like the bug; we need to cast to (long)

alchemistmatt commented 5 years ago

Ah, right, but we can't, due to Java limitations. So the alternative is a 64-bit byte array implementation as suggested by @sampie

FarmGeek4Life commented 5 years ago

so, "sequence" needs to be a big array, and "size" probably needs to be a long.

Jokendo-collab commented 4 years ago

I have a 1.5 GB file size, when I do the search it fails. Would you assist with the binary of the fasta file splitter? Or how can I get around this problem @FarmGeek4Life and @alchemistmatt because this is a metaproteomic analysis which I am running my data

alchemistmatt commented 4 years ago

I have added the FASTA File Splitter executable as a new release, see:

https://github.com/PNNL-Comp-Mass-Spec/Fasta-File-Splitter/releases/

MSGFPlus / msgfplus

MS-GF+ crashes for large FASTA files #10