"[ProteinList_FASTA::createIndex] duplicate protein id" and "Java heap space"

iquasere commented 3 years ago

I am trying to perform PSM with SearchCLI. The input are peak-picked MGF datasets, and I am trying to use Myri-match, X!Tandem and MS-GF+. Both Myri-match and MS-GF+ fail, with X!Tandem managing to finish the matching.

Myri-match claims there is a duplicated protein id in the database, which doesn't make sense: grep 'WP_100909616.1' /mnt/HDDStorage/jsequeira/metaproteomics/database_concatenated_target_decoy.fasta gives

>WP_100909616.1 sulfurtransferase complex subunit TusB [Methanobacterium subterraneum]
>WP_100909616.1 sulfurtransferase complex subunit TusB [Methanobacterium subterraneum]_REVERSED

MS-GF+ claims to run out of memory when creating suffixes:

Suffix creation: 24.09% complete.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.Arrays.copyOf(Arrays.java:3793)
        at edu.ucsd.msjava.msdbsearch.CompactSuffixArray$1Bucket.add(CompactSuffixArray.java:275)
        at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.createSuffixArrayFiles(CompactSuffixArray.java:337)
        at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.<init>(CompactSuffixArray.java:90)
        at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.<init>(CompactSuffixArray.java:110)
        at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:207)
        at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:105)
        at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:56)

SearchGUI version: 3.3.9; Java version: openjdk 11.0.1

This is the full log for one file.

Tue Dec 01 11:57:25 UTC 2020 Validating MGF file: /mnt/HDDStorage/jsequeira/metaproteomics/Metaproteomics/Sample/spectra/pp1a_3.mgf
10% 20% 30% 40% 50% 60% 70% 80% 90%
Tue Dec 01 11:57:25 UTC 2020 Warning: The file 'pp1a_3.mgf' contains zero intensity peaks. It is highly recommended to apply peak picking before starting a search!
Reindexing: database_concatenated_target_decoy.fasta.

Tue Dec 01 11:58:04 UTC 2020 Indexing spectrum files.
Tue Dec 01 11:58:04 UTC 2020 Extracting search settings.

Processing: pp0.01a_3.mgf (1/90)

xtandem command:
/home/jsequeira/anaconda3/envs/proteomics/share/searchgui-3.3.9-1/resources/XTandem/linux/linux_64bit/tandem /home/jsequeira/anaconda3/envs/proteomics/share/searchgui-3.3.9-1/resources/XTandem/linux/linux_64bit/input_searchGUI.xml

Tue Dec 01 11:58:04 UTC 2020 Processing pp0.01a_3.mgf with X!Tandem.

X! TANDEM Vengeance (2015.12.15.2)

Loading spectra (mgf). loaded.
Spectra matching criteria = 23
Starting threads .............. started.
Computing models:
        Spectrum-to-sequence matching process in progress | 50 ks
        Spectrum-to-sequence matching process in progress | 100 ks
        Spectrum-to-sequence matching process in progress | 150 ks
        Spectrum-to-sequence matching process in progress | 200 ks
        Spectrum-to-sequence matching process in progress | 250 ks
        Spectrum-to-sequence matching process in progress | 300 ks
        Spectrum-to-sequence matching process in progress | 350 ks
        Spectrum-to-sequence matching process in progress | 400 ks
        Spectrum-to-sequence matching process in progress | 450 ks
        Spectrum-to-sequence matching process in progress | 500 ks
        Spectrum-to-sequence matching process in progress | 550 ks
        Spectrum-to-sequence matching process in progress | 600 ks
        Spectrum-to-sequence matching process in progress | 650 ks
        Spectrum-to-sequence matching process in progress | 700 ks
        Spectrum-to-sequence matching process in progress | 750 ks
        Spectrum-to-sequence matching process in progress | 800 ks
        Spectrum-to-sequence matching process in progress | 850 ks
        Spectrum-to-sequence matching process in progress | 900 ks
        Spectrum-to-sequence matching p sequences modelled = 931 ks
Model refinement:
        partial cleavage  done.
        unanticipated cleavage  done.
        finishing refinement ... done.
Merging results:
        from 234567891011121314

Creating report:
        initial calculations  ..... done.
        sorting  ..... done.
        finding repeats ..... done.
        evaluating results ..... done.
        calculating expectations ..... done.
        writing results ..... done.

Valid models = 0

Tue Dec 01 11:58:29 UTC 2020 X!Tandem finished for /mnt/HDDStorage/jsequeira/metaproteomics/Metaproteomics/Sample/spectra/pp0.01a_3.mgf (24.6 seconds).

myrimatch command:
/home/jsequeira/anaconda3/envs/proteomics/share/searchgui-3.3.9-1/resources/MyriMatch/linux/linux_64bit/myrimatch -cpus 14 -ProteinDatabase /mnt/HDDStorage/jsequeira/metaproteomics/database_concatenated_target_decoy.fasta /mnt/HDDStorage/jsequeira/metaproteomics/Metaproteomics/Sample/spectra/pp0.01a_3.mgf -OutputFormat mzIdentML -workdir /mnt/HDDStorage/jsequeira/metaproteomics/.SearchGUI_temp -OutputSuffix .myrimatch -DecoyPrefix "" -MinPeptideLength 8 -MaxPeptideLength 30 -MaxResultRank 10 -SpectrumListFilters "" -FragmentMzTolerance "0.02 daltons" -MonoPrecursorMzTolerance "10.0 ppm" -PrecursorMzToleranceRule "mono" -StaticMods "C 57.021464" -DynamicMods "M 0 15.994915 ( 1 42.010565" -MaxDynamicMods 2 -StatusUpdateFrequency 10 -NumChargeStates 4+ -TicCutoffPercentage 0.98 -MinPeptideMass 600.0 -MaxPeptideMass 5000.0 -UseSmartPlusThreeModel true -ComputeXCorr false -NumIntensityClasses 3 -ClassSizeMultiplier 2 -NumBatches 50 -MaxPeakCount 300 -MonoisotopeAdjustmentSet [0,1] -FragmentationAutoRule false -FragmentationRule "cid" -CleavageRules "Trypsin" -MinTerminiCleavages 2 -MaxMissedCleavages 2

Tue Dec 01 11:58:29 UTC 2020 Processing pp0.01a_3.mgf with MyriMatch.

Process #0 (bridgeserver) is starting.
MyriMatch 2.2.10165 (2016-11-7)
FreiCore 1.6.11103 (2017-7-14)
ProteoWizard MSData 3.0.11841 (2018-3-8)
ProteoWizard Proteome 3.0.11579 (2017-11-14)
Vanderbilt University (c) 2012, D.Tabb/M.Chambers/S.Dasari
Licensed under the Apache License, Version 2.0

Could not find the default configuration file (hard-coded defaults in use).
Reading "/mnt/HDDStorage/jsequeira/metaproteomics/database_concatenated_target_decoy.fasta"
Process #0 (bridgeserver) had an error: [ProteinList_FASTA::createIndex] duplicate protein id "WP_100909616.1"

Tue Dec 01 11:58:29 UTC 2020 MyriMatch finished for /mnt/HDDStorage/jsequeira/metaproteomics/Metaproteomics/Sample/spectra/pp0.01a_3.mgf (125.0 milliseconds).

Tue Dec 01 11:58:29 UTC 2020 Could not find MyriMatch result file for pp0.01a_3.mgf.

ms-gf+ command:
/home/jsequeira/anaconda3/envs/proteomics/bin/java -Xms512m -Xmx1g -jar /home/jsequeira/anaconda3/envs/proteomics/share/searchgui-3.3.9-1/resources/MS-GF+/MSGFPlus.jar -s /mnt/HDDStorage/jsequeira/metaproteomics/Metaproteomics/Sample/spectra/pp0.01a_3.mgf -d /mnt/HDDStorage/jsequeira/metaproteomics/database_concatenated_target_decoy.fasta -o /mnt/HDDStorage/jsequeira/metaproteomics/.SearchGUI_temp/pp0.01a_3.msgf.mzid -t 10.0ppm -tda 0 -mod /home/jsequeira/anaconda3/envs/proteomics/share/searchgui-3.3.9-1/resources/MS-GF+/params/Mods.txt -minCharge 2 -maxCharge 4 -inst 3 -thread 14 -m 3 -e 1 -ntt 2 -protocol 0 -minLength 8 -maxLength 30 -n 10 -addFeatures 0 -ti 0,1

Tue Dec 01 11:58:29 UTC 2020 Processing pp0.01a_3.mgf with MS-GF+.

MS-GF+ Release (v2018.04.09) (9 April 2018)
Loading database files...
Warning: Sequence database contains 12 counts of letter 'B', which does not correspond to an amino acid.
Warning: Sequence database contains 2 counts of letter 'J', which does not correspond to an amino acid.
Warning: Sequence database contains 734 counts of letter 'U', which does not correspond to an amino acid.
Warning: Sequence database contains 2962 counts of letter 'X', which does not correspond to an amino acid.
Creating the suffix array indexed file... Size: 290522531
AlphabetSize: 28
Suffix creation: 0.00% complete.
Suffix creation: 3.44% complete.
Suffix creation: 6.88% complete.
Suffix creation: 10.33% complete.
Suffix creation: 13.77% complete.
Suffix creation: 17.21% complete.
Suffix creation: 20.65% complete.
Suffix creation: 24.09% complete.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.Arrays.copyOf(Arrays.java:3793)
        at edu.ucsd.msjava.msdbsearch.CompactSuffixArray$1Bucket.add(CompactSuffixArray.java:275)
        at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.createSuffixArrayFiles(CompactSuffixArray.java:337)
        at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.<init>(CompactSuffixArray.java:90)
        at edu.ucsd.msjava.msdbsearch.CompactSuffixArray.<init>(CompactSuffixArray.java:110)
        at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:207)
        at edu.ucsd.msjava.ui.MSGFPlus.runMSGFPlus(MSGFPlus.java:105)
        at edu.ucsd.msjava.ui.MSGFPlus.main(MSGFPlus.java:56)

Tue Dec 01 11:59:12 UTC 2020 MS-GF+ finished for /mnt/HDDStorage/jsequeira/metaproteomics/Metaproteomics/Sample/spectra/pp0.01a_3.mgf (43.0 seconds).

Tue Dec 01 11:59:12 UTC 2020 Could not find MS-GF+ result file for pp0.01a_3.mgf.

iquasere commented 3 years ago

Trying with the Windows GUI the results are the same. X!Tandem finishes, and can build several models (in the example above it built 0 models, but it was fine with other files) and Myri-match and MS-GF+ had the same problems. I was trying to use these versions because they are the most recent available through Bioconda, and I had problems using SearchGUI 4 and Peptide-Shaker 2 through conda

hbarsnes commented 3 years ago

Myri-match claims there is a duplicated protein id in the database, which doesn't make sense

I'm afraid MyriMatch has it's own internal FASTA header parsing which we cannot control. This is usually not an issue, but in your example it seems to assume that "WP_100909616.1" is the accession number. I think that the only way around this would be to reformat your FASTA headers to make MyriMatch happy. Ideally using our non-standard FASTA format.

MS-GF+ claims to run out of memory when creating suffixes

MS-GF+ uses the same memory settings as the ones given to SearchGUI. So if you are using the command line you have to add the -Xmx option to give Java, and consequently MS-GF+, more memory. In the GUI version of SearchGUI you can increase the memory provided via Edit > Java Settings.

iquasere commented 3 years ago

I guess 1Gb is insufficient for MS-GF+ because of the size of the database (930648 sequences). In the parameters file there is nothing allowing to set more memory, where could I tweak this through SearchCLI?

hbarsnes commented 3 years ago

In the parameters file there is nothing allowing to set more memory, where could I tweak this through SearchCLI?

As mentioned above you simply have to add the standard -Xmx Java option to your SearchCLI command line, e.g.:

java -Xmx2048M -cp SearchGUI-X.Y.Z.jar eu.isas.searchgui.cmd.SearchCLI [parameters]

iquasere commented 3 years ago

Ok, I see it now. The symlink of running searchgui from Bioconda uses 4 Gb of memory, but when running SearchCLI it likely defaults to 1 Gb. And this value can only be changed by not using the symlink, and calling the script directly

~/anaconda3/envs/proteomics/bin/java -splash:resources/conf/searchgui-splash.png -Xms128M -Xmx4096M -cp ~/anaconda3/envs/proteomics/share/searchgui-3.3.9-1/SearchGUI-3.3.9.jar eu.isas.searchgui.cmd.SearchCLI -spectrum_files metaproteomics/test -output_folder metaproteomics -id_params metaproteomics/params.par -threads 14 -xtandem 1 -myrimatch 1 -msgf 1

uses 4 Gb, but

searchgui eu.isas.searchgui.cmd.SearchCLI -spectrum_files metaproteomics/test -output_folder metaproteomics -id_params metaproteomics/params.par -threads 14 -xtandem 1 -myrimatch 1 -msgf 1

will always use 1 Gb. If I'm not mistaken, I cannot use the symlink and a different memory. If so, this could be a parameter for a future version - if it isn't already!

hbarsnes commented 3 years ago

I'm not familiar with the conda setup myself, but you can verify how much memory is given to MS-GF+ by checking the SearchGUI log file where you will see the exact MS-GF+ command line used.

I will check with the developer in charge of the conda setup and get back to you.

hbarsnes commented 3 years ago

It seems like all you have to do is add the Xmx option there as well, i.e.

searchgui eu.isas.searchgui.cmd.SearchCLI -Xmx4096M -spectrum_files [...]

iquasere commented 3 years ago

You are right, sorry for the hassle xD MS-GF+ works perfectly that way. And for Myri-match, gonna have to shape those IDs. Thank you very much for the assistance!

compomics / searchgui

"[ProteinList_FASTA::createIndex] duplicate protein id" and "Java heap space" #268