java.lang.ArrayIndexOutOfBoundsException: 0

lydiayliu commented 4 years ago

Describe the bug I'm searching some mzMLs agaist the RefSeq bacterial database and all has completed except for 2 faa files in the database, the following is one of them. The error seems to be that a suffix array indexed file cannot be created.

Error Message root@87703248a43d:/data# java -Xmx192g -jar /usr/local/bin/MSGFPlus/MSGFPlus.jar -conf /params/conf.txt -d /reference/2020-09-01-decoys-bacteria.nonredundant_protein.394.protein.faa -thread 32 -s /raw/sample.mzML -o sample.bacteria.394.mzid

MS-GF+ Release (v2020.07.02) (5 August 2020)
Java 1.8.0_265 (Private Build)
Linux (amd64, version 3.10.0-1127.el7.x86_64)
Loading database files...
Warning: Sequence database contains 22 counts of letter 'B', which does not correspond to an amino acid.
Warning: Sequence database contains 10 counts of letter 'J', which does not correspond to an amino acid.
Warning: Sequence database contains 292 counts of letter 'U', which does not correspond to an amino acid.
Warning: Sequence database contains 906 counts of letter 'X', which does not correspond to an amino acid.
Warning: Sequence database contains 4 counts of letter 'Z', which does not correspond to an amino acid.
Creating the suffix array indexed file... Size: 0
AlphabetSize: 28
java.lang.ArrayIndexOutOfBoundsException: 0
        at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.getByteAt(CompactFastaSequence.java:261)
        at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.createSuffixArrayFiles(CompactSuffixArray.java:340)
        at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.<init>(CompactSuffixArray.java:97)
        at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.<init>(CompactSuffixArray.java:117)
        at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:253)
        at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:113)
        at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:61)

To Reproduce This is using the bacteria.nonredundantprotein.394.protein.faa from RefSeq, with decoys appended to the end of the fasta file with heading rev From here: https://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

Additional context The database has more sequences than those with an invalid letter:

root@87703248a43d:/data# grep '>' /reference/2020-09-01-decoys-bacteria.nonredundant_protein.394.protein.faa | wc -l
218808

And the database is not astronomical in size:

root@87703248a43d:/data# ls -lh /reference/2020-09-01-decoys-bacteria.nonredundant_protein.394.protein.faa
-rw-r--r--. 1 1000 1001 104M Sep  1 20:11 /reference/2020-09-01-decoys-bacteria.nonredundant_protein.394.protein.faa

Jokendo-collab commented 4 years ago

@foreverwander what is the database size? it is more that 500MBs if yes then you need to split the database because MSGF+ does not support database larger than 500MBs. du -h filaname.fasta will show you the size of the database

lydiayliu commented 4 years ago

The database size is 104M and there are 218808 protein sequences, as I have put in additional context section...

Jokendo-collab commented 4 years ago

@foreverwander that database size is okay. How did you generate mzML files? did you select peackpicking?

lydiayliu commented 4 years ago

The mzML files were generated using msconvert. In fact I've searched the same mzML file with all 1091 bacteria faa files from refseq, and this is one of the 2 fasta files that didn't run (the other 1089 completed successfully). That's why I believe the problem is with how msgfplus is dealing with the fasta file itself.

lydiayliu commented 4 years ago

And yes peakpicking was selected, full command below: chambm/pwiz-skyline-i-agree-to-the-vendor-licenses wine msconvert sample.raw --mzML --zlib --filter "peakPicking true 1-" --filter "titleMaker <RunId>.<ScanNumber>.<ScanNumber>.<ChargeState>"

Jokendo-collab commented 4 years ago

@foreverwander if these raw files worked with the other analysis then the problem maybe with the fasta file. What I know is that MSGF+ appends the decoy sequences on the fly and if your fasta file already has some decoy sequences then that could be the problem

lydiayliu commented 4 years ago

I have set the following parameter:

#Target/Decoy search mode
#  0 means don't search decoy database (default)
#  1 means search decoy database to compute FDR (source FASTA file must be forward-only proteins)
TDA=0

This mean that MSGF+ is not trying to append the decoy sequence, in fact it will give an error since the database already has decoy sequences. I've searched with other databases where I have pre-appended the decoy sequence and it was fine, MSGF+ didn't try to reverse the sequences again

lydiayliu commented 4 years ago

Update The problem must be with how MSGF+ is parsing / reading the database file with decoys appended.

root@87703248a43d:/data# java -Xmx48G -cp /usr/local/bin/MSGFPlus/MSGFPlus.jar edu.ucsd.msjava.msdbsearch.BuildSA -d  /reference/2020-09-01-decoys-bacteria.nonredundant_protein.394.protein.faa -tda 0 -decoy rev                          
Building suffix array: /reference/2020-09-01-decoys-bacteria.nonredundant_protein.394.protein.faa                                                                                                                                           
Creating the suffix array indexed file... Size: 0                                                                                                                                                                                           
AlphabetSize: 28                                                                                                                                                                                                                            
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0                                                                                                                                                                      
        at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.getByteAt(CompactFastaSequence.java:261)                                                                                                                                         
        at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.createSuffixArrayFiles(CompactSuffixArray.java:340)                                                                                                                                
        at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.<init>(CompactSuffixArray.java:97)                                                                                                                                                 
        at edu.ucsd.msjava.msdbsearch.BuildSA.buildSAFiles(BuildSA.java:184)                                                                                                                                                                
        at edu.ucsd.msjava.msdbsearch.BuildSA.buildSA(BuildSA.java:96)                                                                                                                                                                      
        at edu.ucsd.msjava.msdbsearch.BuildSA.main(BuildSA.java:56)

Interestingly, there is no error if I do java -Xmx48G -cp /usr/local/bin/MSGFPlus/MSGFPlus.jar edu.ucsd.msjava.msdbsearch.BuildSA -d /reference/bacteria.nonredundant_protein.394.protein.faa -decoy rev on the original fasta file without decoys. MSGF+ is able to parse the file, create the revCat.fasta and suffix index.

But there are no sequence description differences between the decoy file that I was originally using and the one created by MSGF+, so I'm really wondering where the error could be. My fasta file with decoy appended was created using the philsopher database function: $philosopherPath database --annotate $fastaPath --prefix $decoyPrefix

MSGFPlus / msgfplus

java.lang.ArrayIndexOutOfBoundsException: 0 #110