ConcurrentModificationException when loading Mascot DAT files

elrubio commented 9 months ago

When trying to import Mascot results (without automatic decoy search) from Mascot DAT files, a java.util.ConcurrentModificationException is thrown consistently for all files.

Setting the number cores to 1 doesn't help.

Mascot version: 2.8.2

I can provide a sample file upon request.

Wed Sep 13 13:16:28 CEST 2023: PeptideShaker version 2.2.25.
Memory given to the Java virtual machine: 20971520000.
Total amount of memory in the Java virtual machine: 134217728.
Free memory: 89919864.
Java version: 20.0.2.
1714 script command tokens
(C) 2009 Jmol Development
Jmol Version: 12.0.43  2011-05-03 14:21
java.vendor: Amazon.com Inc.
java.version: 20.0.2
os.name: Windows 11
memory: 53.5/167.8
processors available: 12
useCommandThread: false
java.util.ConcurrentModificationException
    at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1631)
    at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
    at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:276)
    at java.base/java.util.TreeMap$ValueSpliterator.forEachRemaining(TreeMap.java:3215)
    at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
    at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:276)
    at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1787)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
    at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
    at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
    at com.compomics.util.experiment.io.identification.idfilereaders.MascotIdfileReader.getAllSpectrumMatches(MascotIdfileReader.java:206)
    at eu.isas.peptideshaker.fileimport.FileImporter.importPsms(FileImporter.java:474)
    at eu.isas.peptideshaker.fileimport.FileImporter.importFiles(FileImporter.java:277)
    at eu.isas.peptideshaker.PeptideShaker.importFiles(PeptideShaker.java:224)
    at eu.isas.peptideshaker.gui.NewDialog$20.run(NewDialog.java:736)
    at java.base/java.lang.Thread.run(Thread.java:1623)

hbarsnes commented 9 months ago

Can you have a go at this beta version of PeptideShaker to see it it solves the problem?

https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/3.0.0-beta/PeptideShaker-3.0.0-beta.zip

elrubio commented 9 months ago

Thanks for your reply. Unfortunately, the error is also present in the beta version.

hbarsnes commented 9 months ago

If you can share one of the files (plus the corresponding FASTA and spectrum files) I can try to reproduce the issue on my end and see if I can find the source of the issue.

elrubio commented 9 months ago

I am no Java expert, but I am quite confident that I identified the responsible code. As visible from the stack trace, it's actually located in the compomics.util library, so you could move the issue over there.

In MascotIdfileReader.getAllSpectrumMatches, while iterating over PeptideAssumption objects, the underlying collection is modified, resulting in a concurrency error (immediately and reliably, at least in my case). The only condition is that a peptide sequence contains non-standard amino acids, like X.

I uploaded my files anyways, just in case: Download

hbarsnes commented 9 months ago

I've fixed the issue and updated the beta version: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/3.0.0-beta/PeptideShaker-3.0.0-beta.zip

However, I came across a second issue which is the use of the "+" in the name of the mgf file. This causes problems later on when parsing the spectrum titles (as they include the file name). Would it be possible for you to rename the mgf files to remove the "+" (and redo the Mascot search)?

elrubio commented 9 months ago

Thanks for the quick fix for the concurrency issue. After pushing my .dat and .mgf files through sed to remove any plus characters in the spectrum titles, they are now correctly parsed. All looks good now. 👍

But since "+" is a perfectly valid character for both, file names and to (my knowledge) spectrum titles in mgf files, I actually consider this as a bug.

hbarsnes commented 9 months ago

But since "+" is a perfectly valid character for both, file names and to (my knowledge) spectrum titles in mgf files, I actually consider this as a bug.

I agree that it is not optimal, but I think we added this decoding of the spectrum titles (from utf-8) as there were some cases where other Unicode characters were encoded in the titles which caused issues for some of the search engines. However, the "+" is sort of a special case that ideally ought to be ignored, i.e. not decoded to space, but not sure how easy that would be or whether this would be a solution that potentially results in other problems.

Anyway, great that you were able to work around this final issue.

elrubio commented 9 months ago

Thank you for your feedback.

Unfortunately we have many existing MGF files. All of them potentially contain "+" characters in the TITLE line, many of them actually do. The current behaviour requires me to copy and pre-process all data files before using them in PeptideShaker. A permanent change would break linkage to existing ident file.

Considering the fact that url-endoded titles should be rather an exception than the rule, I opened this as an issue in the underlying library: https://github.com/compomics/compomics-utilities/issues/61. Maybe decoding could be made optional?

compomics / peptide-shaker

ConcurrentModificationException when loading Mascot DAT files #524