Closed lydiayliu closed 2 years ago
@foreverwander what is the database size? it is more that 500MBs if yes then you need to split the database because MSGF+ does not support database larger than 500MBs. du -h filaname.fasta
will show you the size of the database
The database size is 104M and there are 218808 protein sequences, as I have put in additional context section...
@foreverwander that database size is okay. How did you generate mzML files? did you select peackpicking?
The mzML files were generated using msconvert. In fact I've searched the same mzML file with all 1091 bacteria faa files from refseq, and this is one of the 2 fasta files that didn't run (the other 1089 completed successfully). That's why I believe the problem is with how msgfplus is dealing with the fasta file itself.
And yes peakpicking was selected, full command below:
chambm/pwiz-skyline-i-agree-to-the-vendor-licenses wine msconvert sample.raw --mzML --zlib --filter "peakPicking true 1-" --filter "titleMaker <RunId>.<ScanNumber>.<ScanNumber>.<ChargeState>"
@foreverwander if these raw files worked with the other analysis then the problem maybe with the fasta file. What I know is that MSGF+ appends the decoy sequences on the fly and if your fasta file already has some decoy sequences then that could be the problem
I have set the following parameter:
#Target/Decoy search mode
# 0 means don't search decoy database (default)
# 1 means search decoy database to compute FDR (source FASTA file must be forward-only proteins)
TDA=0
This mean that MSGF+ is not trying to append the decoy sequence, in fact it will give an error since the database already has decoy sequences. I've searched with other databases where I have pre-appended the decoy sequence and it was fine, MSGF+ didn't try to reverse the sequences again
Update The problem must be with how MSGF+ is parsing / reading the database file with decoys appended.
root@87703248a43d:/data# java -Xmx48G -cp /usr/local/bin/MSGFPlus/MSGFPlus.jar edu.ucsd.msjava.msdbsearch.BuildSA -d /reference/2020-09-01-decoys-bacteria.nonredundant_protein.394.protein.faa -tda 0 -decoy rev
Building suffix array: /reference/2020-09-01-decoys-bacteria.nonredundant_protein.394.protein.faa
Creating the suffix array indexed file... Size: 0
AlphabetSize: 28
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.getByteAt(CompactFastaSequence.java:261)
at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.createSuffixArrayFiles(CompactSuffixArray.java:340)
at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.<init>(CompactSuffixArray.java:97)
at edu.ucsd.msjava.msdbsearch.BuildSA.buildSAFiles(BuildSA.java:184)
at edu.ucsd.msjava.msdbsearch.BuildSA.buildSA(BuildSA.java:96)
at edu.ucsd.msjava.msdbsearch.BuildSA.main(BuildSA.java:56)
Interestingly, there is no error if I do
java -Xmx48G -cp /usr/local/bin/MSGFPlus/MSGFPlus.jar edu.ucsd.msjava.msdbsearch.BuildSA -d /reference/bacteria.nonredundant_protein.394.protein.faa -decoy rev
on the original fasta file without decoys. MSGF+ is able to parse the file, create the revCat.fasta and suffix index.
But there are no sequence description differences between the decoy file that I was originally using and the one created by MSGF+, so I'm really wondering where the error could be. My fasta file with decoy appended was created using the philsopher database function:
$philosopherPath database --annotate $fastaPath --prefix $decoyPrefix
Describe the bug I'm searching some mzMLs agaist the RefSeq bacterial database and all has completed except for 2 faa files in the database, the following is one of them. The error seems to be that a suffix array indexed file cannot be created.
Error Message
root@87703248a43d:/data# java -Xmx192g -jar /usr/local/bin/MSGFPlus/MSGFPlus.jar -conf /params/conf.txt -d /reference/2020-09-01-decoys-bacteria.nonredundant_protein.394.protein.faa -thread 32 -s /raw/sample.mzML -o sample.bacteria.394.mzid
To Reproduce This is using the bacteria.nonredundantprotein.394.protein.faa from RefSeq, with decoys appended to the end of the fasta file with heading rev From here: https://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/
Additional context The database has more sequences than those with an invalid letter:
And the database is not astronomical in size: