MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
72 stars 36 forks source link

Java.lang.NegativeArraySizeException #140

Open Luxxii opened 2 years ago

Luxxii commented 2 years ago

Describe the bug Hello everyone, currently i am trying to index large peptide fasta files (~50 GB) for peptide searches. This fasta contains 85748938 entries of short peptides (all of them are unique). I am using the SABuild function and call it as follows:

java -Xmx256000M -cp <PATH>/MSGFPLUS_v20220418/MSGFPlus.jar edu.ucsd.msjava.msdbsearch.BuildSA -d peptides.fasta -tda 1 -decoy XXX

and getting the following Error from MSGF+:

Creating peptides.revCat.fasta.
Building suffix array: /mntc/<PATH>/work/f5/71e50c34429da341c0ad240e4f40ed/peptides.revCat.fasta
Exception in thread "main" java.lang.NegativeArraySizeException: -541141435
        at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.readSequence(CompactFastaSequence.java:542)
        at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.<init>(CompactFastaSequence.java:139)
        at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.<init>(CompactFastaSequence.java:89)
        at edu.ucsd.msjava.msdbsearch.BuildSA.buildSAFiles(BuildSA.java:144)
        at edu.ucsd.msjava.msdbsearch.BuildSA.buildSA(BuildSA.java:96)
        at edu.ucsd.msjava.msdbsearch.BuildSA.main(BuildSA.java:56)

This leads to the following lines here.

I was wondering if this error could be fixed quickly, since i would like to use MSGF+ for identification, even for these large fastas i am using here. Maybe it is only a simple manner of using long instead of int, because of an possible overflow happening here. But i cannot judge if other places need to be adjusted.

FarmGeek4Life commented 2 years ago

This issue is caused by a limitation of the current implementation of MS-GF+ in Java, and fixing it is not a simple nor quick change. The issue is due to overflows on array sizes, and fixing it would involve changing arrays in many places to use an array type that supports indexing with long instead of int.

We do have other tools that we use for splitting fasta files into small enough sizes, then searching the data files with each fasta file, and then merging all results for a single data file back into one mzid file.


From: Dominik Lux @.> Sent: Friday, July 1, 2022 12:56:51 AM To: MSGFPlus/msgfplus @.> Cc: Subscribed @.***> Subject: [MSGFPlus/msgfplus] Java.lang.NegativeArraySizeException (Issue #140)

Check twice before you click! This email originated from outside PNNL.

Describe the bug Hello everyone, currently i am trying to index large peptide fasta files (~50 GB) for peptide searches. This fasta contains 85748938 entries of short peptides (all of them are unique). I am using the SABuild function and call it as follows:

Xmx256000M -d mouse_mzml_specific_peptides.fasta -tda 1 -decoy XXX java -Xmx256000M -cp /MSGFPLUS_v20220418/MSGFPlus.jar edu.ucsd.msjava.msdbsearch.BuildSA -d mouse_mzml_specific_peptides.fasta -tda 1 -decoy XXX

and getting the following Error from MSGF+:

Creating peptides.revCat.fasta. Building suffix array: /mntc//work/f5/71e50c34429da341c0ad240e4f40ed/peptides.revCat.fasta Exception in thread "main" java.lang.NegativeArraySizeException: -541141435 at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.readSequence(CompactFastaSequence.java:542) at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.(CompactFastaSequence.java:139) at edu.ucsd.msjava.msdbsearch.CompactFastaSequence.(CompactFastaSequence.java:89) at edu.ucsd.msjava.msdbsearch.BuildSA.buildSAFiles(BuildSA.java:144) at edu.ucsd.msjava.msdbsearch.BuildSA.buildSA(BuildSA.java:96) at edu.ucsd.msjava.msdbsearch.BuildSA.main(BuildSA.java:56)

This leads to the following lines herehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMSGFPlus%2Fmsgfplus%2Fblob%2F11b6e2e5a1caac0f429a0bd3bffda6672853abae%2Fsrc%2Fmain%2Fjava%2Fedu%2Fucsd%2Fmsjava%2Fmsdbsearch%2FCompactFastaSequence.java%23L531-L552&data=05%7C01%7Cbryson.gibbons%40pnnl.gov%7C843d071d704b4d44d46108da5b374231%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637922590145055751%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Yp%2BXAj4iU0vaxlxLNd15BFe%2B%2FbvOARs4zHV7pxrpsgM%3D&reserved=0.

I was wondering if this error could be fixed quickly, since i would like to use MSGF+ for identification, even for these large fastas i am using here. Maybe it is only a simple manner of using long instead of int, because of an possible overflow happening here. But i cannot judge if other places need to be adjusted.

— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMSGFPlus%2Fmsgfplus%2Fissues%2F140&data=05%7C01%7Cbryson.gibbons%40pnnl.gov%7C843d071d704b4d44d46108da5b374231%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637922590145055751%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vlyrr7IrmLeXgVobFCK%2FKOz91e%2BzWm%2F%2BZep%2FTAKF2%2Fs%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABPPX5JF6VMGHMHBJP2M5SDVR2QEHANCNFSM52L26GGQ&data=05%7C01%7Cbryson.gibbons%40pnnl.gov%7C843d071d704b4d44d46108da5b374231%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637922590145055751%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WzGwdO8RggT8%2FFdE%2Bu7r4ZuE12qu3vLlYaEoVBh1zTw%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Luxxii commented 2 years ago

Thanks for the quick answer! and the clarification! Yes, splitting fasta files are always an option... However, i would look forward to execute a search via a single large fasta file.

If this is not a priority or not planned, then you can close this issue.