compomics / peptide-shaker

Interpretation of proteomics identification results
http://compomics.github.io/projects/peptide-shaker.html
47 stars 19 forks source link

identXML export problem #447

Closed bernt-matthias closed 3 years ago

bernt-matthias commented 3 years ago

I have the following in my log file (PS 2.0.15 via Galaxy)

java.lang.StringIndexOutOfBoundsException: begin 1633, end 584, length 584
    at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
    at java.base/java.lang.String.substring(String.java:1874)
    at com.compomics.util.experiment.identification.protein_inference.fm_index.FMIndex.getSubsequence(FMIndex.java:5750)
    at com.compomics.util.experiment.identification.utils.PeptideUtils.getAaBefore(PeptideUtils.java:77)
    at eu.isas.peptideshaker.export.MzIdentMLExport.writeSequenceCollection(MzIdentMLExport.java:870)
    at eu.isas.peptideshaker.export.MzIdentMLExport.createMzIdentMLFile(MzIdentMLExport.java:298)
    at eu.isas.peptideshaker.cmd.CLIExportMethods.exportMzId(CLIExportMethods.java:523)
    at eu.isas.peptideshaker.cmd.PeptideShakerCLI.call(PeptideShakerCLI.java:586)
    at eu.isas.peptideshaker.cmd.PeptideShakerCLI.main(PeptideShakerCLI.java:1389)
hbarsnes commented 3 years ago

@dominik-kopczynski Can you please take a look at this one? I think we've seen something similar before?

hbarsnes commented 3 years ago

@bernt-matthias Would it be possible for you to share the data so that we can try to reproduce the issue on our end?

bernt-matthias commented 3 years ago

Here you go http://139.18.2.180/~maze/searchgui_input.zip .. please ping me if you got it, then I can remove it there.

The executed command line is

mkdir output_reports && 
cwd=`pwd` && 
export HOME=$cwd &&  
ln -s '/gpfs1/data/galaxy_server/galaxy/database/files/000/338/dataset_338492.dat' searchgui_input.zip &&  
jar xvf searchgui_input.zip SEARCHGUI_IdentificationParameters.par &&   
peptide-shaker -Djava.awt.headless=true eu.isas.peptideshaker.cmd.PeptideShakerCLI -gui 0 -temp_folder $cwd/PeptideShakerCLI -log $cwd/resources -reference 'Galaxy_Experiment_2021032415271616596042' -identification_files $cwd/searchgui_input.zip -id_params $cwd/SEARCHGUI_IdentificationParameters.par   -threads "${GALAXY_SLOTS:-12}"   -output_file $cwd/output.mzid -include_sequences 0 -contact_first_name "Proteomics" -contact_last_name "Galaxy" -contact_email "galaxyp@umn.edu" -contact_address "galaxyp@umn.edu" -organization_name "University of Minnesota" -organization_email "galaxyp@umn.edu" -organization_address "Minneapolis, MN 55455, Vereinigte Staaten"  -out_reports $cwd/output_reports -reports 3,9,6

This produces also the error/warning from https://github.com/compomics/peptide-shaker/issues/448

hbarsnes commented 3 years ago

Thanks for sharing the data. You can now remove it. I can also confirm that I've been able to reproduce the issue. I will look into it some more and get back to you.

BTW, are you aware that your FASTA file does not contain any decoys?

bernt-matthias commented 3 years ago

BTW, are you aware that your FASTA file does not contain any decoys?

Actually no .. I forwarded the info to the user.

hbarsnes commented 3 years ago

Ok, so the problem is the non-standard FASTA headers. For example for a header such as ">A0A1Q3NBR6|unreviewed|Pyruvate" the accession number is assumed to be "unreviewed". Thus all headers of this type end up having the same accession number which later results in the above issue as we end up referring to the wrong protein sequence and get the StringIndexOutOfBoundsException.

The solution is to reformat the headers: https://github.com/compomics/searchgui/wiki/DatabaseHelp#non-standard-fasta.

bernt-matthias commented 3 years ago

Ah. Thanks for the info. Might be a good idea to check for this somehow.

Also we should improve the help section of the Galaxy wrapper :)

hbarsnes commented 3 years ago

Might be a good idea to check for this somehow.

Yes, I agree. It also used to be checked, but seems to have been removed in the refactoring. I will see if I can re-add it in the next release.

hbarsnes commented 3 years ago

The check for duplicate accession numbers in FASTA files has now been re-added in SearchGUI v4.0.25.