MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
77 stars 36 forks source link

Error occurred while building the index - Exception in thread "main" java.lang.NegativeArraySizeException #43

Open lanyuchunmo opened 6 years ago

lanyuchunmo commented 6 years ago

I executed build an index for a database using command: java -Xmx30000M -cp MSGFPlus.jar edu.ucsd.msjava.msdbsearch.BuildSA -d CosmicMutantExport.fasta -tda 1 but I encountered the following error message:

Exception in thread "main" java.lang.NegativeArraySizeException at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.readSequence(CompactFastaSequence.java:423) at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.(CompactFastaSequence.java:98) at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.(CompactFastaSequence.java:63) at edu.ucsd.msjava.msdbsearch.BuildSA.buildSAFiles(BuildSA.java:110) at edu.ucsd.msjava.msdbsearch.BuildSA.buildSA(BuildSA.java:76) at edu.ucsd.msjava.msdbsearch.BuildSA.main(BuildSA.java:48)

At first I thought there was something wrong with the database file, so I changed a new database file, but I still got this error, Could anyone help me see how this problem can be solved?

alchemistmatt commented 6 years ago

I don't recognize that error. How large is your FASTA file (CosmicMutantExport.fasta)? The NegativeArraySizeException error implies some sort of overflow. Perhaps the file has some extra long protein sequences (or extra long protein descriptions). You could try processing the file with Protein Digestion Simulator using the validate FASTA file option to check for errors. Next, go to the Fixed Fasta Options tab and enable "Generate fixed Fasta file" and create a new FASTA file, then try to index that new file.

Also, I've never run MSGF+ with 30 GB of memory; perhaps try just 10 GB (10000M). As for diagnosing, you might have to send us that FASTA file so that we can run MS-GF+ in debug mode to pinpoint the error.

tivdnbos commented 6 years ago

Hi,

I'm having the same problem. I'm working with a big fasta file (3.7 million bacteria protein sequences in the revCat.fasta file), with plenty of memory. I'm working on Linux, but wine doesn't want to open the Protein Digestion Simulator. Is the above mentioned problem already solved?

Cheers, Tim

lanyuchunmo commented 6 years ago

I don't recognize that error. How large is your FASTA file (CosmicMutantExport.fasta)? The NegativeArraySizeException error implies some sort of overflow. Perhaps the file has some extra long protein sequences (or extra long protein descriptions). You could try processing the file with Protein Digestion Simulator using the validate FASTA file option to check for errors. Next, go to the Fixed Fasta Options tab and enable "Generate fixed Fasta file" and create a new FASTA file, then try to index that new file.

Also, I've never run MSGF+ with 30 GB of memory; perhaps try just 10 GB (10000M). As for diagnosing, you might have to send us that FASTA file so that we can run MS-GF+ in debug mode to pinpoint the error.

Hi, I have the same problem again, this time, the size of my database file is 1.6GB, which have 13087546 proteine/peptide sequence, derived from an antibody repertoire of immunized animal sample, and added common contaminant protein. I use the frequency of the sequence in the repertoire as the name(sequence ID), so there are many sequence have the same name, so is there a problem with the naming?

The following is the command: java -Xmx10000M -cp MSGFPlus.jar edu.ucsd.msjava.msdbsearch.BuildSA -d antibodyRepertoire.targetDatabase.fa -tda 0

NOW, I share my compressed database file on the internet, the follow is the download link: https://pan.genomics.cn/ucdisk/s/EJra2i

FarmGeek4Life commented 5 years ago

This is a problem with the implementation of the search in MS-GF+, and limitations of Java. Java uses a 32-bit integer as the index for an array, which limits values to ~2.147 billion entries; MS-GF+ accesses all peptides in the fasta file in a way that means each residue is one entry in an array. Your database file, at 1.6GB, is not big enough to have this problem for just a target or decoy search; however, when creating the concatenated target/decoy files for a target and decoy combined search, the number of residues is doubled, which puts it over this limit.

bernt-matthias commented 5 years ago

Same problem here.

jspmccain commented 5 years ago

Any ideas for tackling this issue? Maybe running the target and decoy searches separately and then aggregating the output?

FarmGeek4Life commented 5 years ago

One option is splitting up the fasta file into several smaller fasta files, then running the search on each of those. When the searches all complete you can use the MzidMerger to re-combine the results.

jspmccain commented 5 years ago

Cool, thanks so much!

bernt-matthias commented 5 years ago

Splitting the fasta definitely helps for the computational problems. But I'm wondering how FDRs are treated?

bernt-matthias commented 5 years ago

Could OpenMS' IDMerger + FalseDiscoveryRate work here. We are using MSGFPlus via OpenMS' Adapter anyway...