compomics / peptide-shaker

Interpretation of proteomics identification results
http://compomics.github.io/projects/peptide-shaker.html
48 stars 18 forks source link

Attempting to create duplicate peptide key #56

Closed GoogleCodeExporter closed 7 years ago

GoogleCodeExporter commented 9 years ago
> What steps will reproduce the problem?

We loaded 12 result files produced by Searchgui (MS-GF+) into PeptideShaker.
After about 2 hours of analysis, with PeptideShaker at more than 80% of the 
total progress, in the log window it says 'importing data cancelled' and starts 
a cancellation process. The process seems to hang or stop.

The last part of the report file is:
...
...
Wed Nov 12 14:08:54 KST 2014        26085 conflicts resolved. 12743 protein 
groups remaining (140 suspicious).
Wed Nov 12 14:08:54 KST 2014        Correcting protein probabilities.
Wed Nov 12 14:08:54 KST 2014        Saving protein probabilities.
Wed Nov 12 14:08:54 KST 2014        Validating identifications at 1% FDR, 
quality control of matches.
Wed Nov 12 14:09:59 KST 2014        Scoring PTMs in peptides.

Wed Nov 12 14:10:00 KST 2014        Importing Data Canceled!
Wed Nov 12 14:10:00 KST 2014        An error occured while loading the 
identification files:
Wed Nov 12 14:10:00 KST 2014        Attempting to create duplicate peptide key: 
RIHMEKEMEMKK_15.994915-ATAA-4.

> What version of the product are you using? On what operating system?

PeptideShaker 0.35.1
SearchGUI-1.22.2
Windows 7 Professional 64bit (8 GB RAM)
Java 1.8.0_25

> Please provide any additional information below.

> If the reported issue resulted in the tool crashing, please
> also upload the file called PeptideShaker.log (found in the
> PeptideShaker-X.Y.Z\resources folder).

Wed Nov 12 11:46:27 KST 2014: PeptideShaker version 0.35.1.
Memory given to the Java virtual machine: 7456948224.
Total amount of memory in the Java virtual machine: 128974848.
Free memory: 110682088.
Java version: 1.8.0_25.
1714 script command tokens
(C) 2009 Jmol Development
Jmol Version: 12.0.43  2011-05-03 14:21
java.vendor: Oracle Corporation
java.version: 1.8.0_25
os.name: Windows 7
memory: 30.5/129.0
processors available: 4
useCommandThread: false
<CompomicsError>PeptideShaker processing failed. See the PeptideShaker log for 
details.</CompomicsError>
java.lang.IllegalArgumentException: Attempting to create duplicate peptide key: 
RIHMEKEMEMKK_15.994915-ATAA-4.
    at eu.isas.peptideshaker.ptm.PtmScorer.scorePTMs(PtmScorer.java:823)
    at eu.isas.peptideshaker.ptm.PtmScorer.scorePeptidePtms(PtmScorer.java:1306)
    at eu.isas.peptideshaker.PeptideShaker.processIdentifications(PeptideShaker.java:453)
    at eu.isas.peptideshaker.fileimport.FileImporter$IdProcessorFromFile.importFiles(FileImporter.java:554)
    at eu.isas.peptideshaker.fileimport.FileImporter$IdProcessorFromFile.doInBackground(FileImporter.java:454)
    at javax.swing.SwingWorker$1.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at javax.swing.SwingWorker.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Free memory: 2340545688

Wed Nov 12 14:25:20 KST 2014: PeptideShaker version 0.35.1.
Memory given to the Java virtual machine: 7456948224.
Total amount of memory in the Java virtual machine: 128974848.
Free memory: 110680088.
Java version: 1.8.0_25.
1714 script command tokens
(C) 2009 Jmol Development
Jmol Version: 12.0.43  2011-05-03 14:21
java.vendor: Oracle Corporation
java.version: 1.8.0_25
os.name: Windows 7
memory: 30.2/129.0
processors available: 4
useCommandThread: false

Original issue reported on code.google.com by Jihye.Sh...@gmail.com on 12 Nov 2014 at 5:29

GoogleCodeExporter commented 9 years ago
Thank you for reporting this issue. We have already made a fix that should 
solve the problem. A new version of PeptideShaker should be available by the 
end of this week.

Original comment by harald.b...@gmail.com on 12 Nov 2014 at 10:30

GoogleCodeExporter commented 9 years ago
A new version which should fix this issue has been released today, can you 
verify that the data loads now?
Feel free to contact us again should you encounter issues again.

Original comment by mvau...@gmail.com on 14 Nov 2014 at 7:24

GoogleCodeExporter commented 9 years ago
Thanks for your response. But, same error occuered while loading the 
identification files

...
...
Mon Nov 17 15:04:42 KST 2014        Validating identifications at 1% FDR, 
quality control of matches.
Mon Nov 17 15:05:16 KST 2014        Scoring PTMs in peptides.

Mon Nov 17 15:05:17 KST 2014        Importing Data Canceled!
Mon Nov 17 15:05:17 KST 2014        An error occurred while loading the 
identification files:
Mon Nov 17 15:05:17 KST 2014        Attempting to create duplicate peptide key: 
RIHMEKEMEMKK_15.994915-ATAA-4.
...

Mon Nov 17 14:14:53 KST 2014: PeptideShaker version 0.35.2.
Memory given to the Java virtual machine: 7456948224.
Total amount of memory in the Java virtual machine: 128974848.
Free memory: 110674712.
Java version: 1.8.0_25.
jshortcut.dll loaded via tmp generated pathname: 
C:\Users\JJUN\AppData\Local\Temp\jshortcut-3361852055548497046.dll
1714 script command tokens
(C) 2009 Jmol Development
Jmol Version: 12.0.43  2011-05-03 14:21
java.vendor: Oracle Corporation
java.version: 1.8.0_25
os.name: Windows 7
memory: 37.1/129.0
processors available: 4
useCommandThread: false
<CompomicsError>PeptideShaker processing failed. See the PeptideShaker log for 
details.</CompomicsError>
java.lang.IllegalArgumentException: Attempting to create duplicate peptide key: 
RIHMEKEMEMKK_15.994915-ATAA-4.
    at eu.isas.peptideshaker.ptm.PtmScorer.scorePTMs(PtmScorer.java:842)
    at eu.isas.peptideshaker.ptm.PtmScorer.scorePeptidePtms(PtmScorer.java:1327)
    at eu.isas.peptideshaker.PeptideShaker.processIdentifications(PeptideShaker.java:453)
    at eu.isas.peptideshaker.fileimport.FileImporter$IdProcessorFromFile.importFiles(FileImporter.java:540)
    at eu.isas.peptideshaker.fileimport.FileImporter$IdProcessorFromFile.doInBackground(FileImporter.java:440)
    at javax.swing.SwingWorker$1.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at javax.swing.SwingWorker.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Free memory: 2527906024

.....

please one more check..
Thank you, sir.

Original comment by Jihye.Sh...@gmail.com on 17 Nov 2014 at 7:38

GoogleCodeExporter commented 9 years ago
Hi and sorry that the problem is not solved as it should. Can you make the data 
available to us so that we look into what went wrong?

Original comment by mvau...@gmail.com on 17 Nov 2014 at 10:48

GoogleCodeExporter commented 9 years ago

Original comment by mvau...@gmail.com on 17 Nov 2014 at 10:49

GoogleCodeExporter commented 9 years ago
Hi again,

I don't think that we received your data to test? Can you verify that the 
problem still occurs on the latest version and show us how to reproduce the bug 
if it happens again?

Thank you,

Marc

Original comment by mvau...@gmail.com on 23 Jan 2015 at 2:57

GoogleCodeExporter commented 9 years ago
Hi,

The latest version should fix this issue. Please contact us again if it is not 
the case.

Best regards,

Marc

Original comment by mvau...@gmail.com on 5 Jun 2015 at 5:30

jflucier commented 8 years ago

Hello,

we obtain this exact same error with the latest release of peptideshaker (PeptideShaker-1.13.4) when analysing some mzXML files. Here is the full peptideshaker trace:

Path configuration completed. Wed Aug 31 13:16:42 EDT 2016 Unzipping searchgui_out.zip. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Wed Aug 31 13:17:03 EDT 2016 Import process for TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09 (Sample: sample, Replicate: 1)

Wed Aug 31 13:17:03 EDT 2016 Importing sequences from searchgui_db_concatenated_target_decoy.fasta. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:19:50 EDT 2016 FASTA file import completed. Wed Aug 31 13:19:50 EDT 2016 Importing gene mappings. Wed Aug 31 13:20:12 EDT 2016 Establishing local database connection. Wed Aug 31 13:20:18 EDT 2016 Reading identification files. Wed Aug 31 13:20:18 EDT 2016 Parsing TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.comet.pep.xml. Wed Aug 31 13:20:26 EDT 2016 Loading spectra for TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.comet.pep.xml. Wed Aug 31 13:20:26 EDT 2016 Importing TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.mgf Wed Aug 31 13:20:27 EDT 2016 TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.mgf imported. 10% 20% 30% 40% 50% 60% 70% 80% 90% Wed Aug 31 13:20:27 EDT 2016 Collecting peptides to map. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:20:28 EDT 2016 Mapping peptides to proteins. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:20:28 EDT 2016 Importing PSMs from TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.comet.pep.xml 10% 20% 30% 40% 50% 60% 70% 80% 90% Wed Aug 31 13:20:33 EDT 2016 1545 identified spectra (6.9%) did not present a valid peptide. Wed Aug 31 13:20:33 EDT 2016 63701 of the best scoring peptides were excluded by the import filters: Wed Aug 31 13:20:33 EDT 2016 - 49.9% peptide length less than 8 or greater than 30. Wed Aug 31 13:20:33 EDT 2016 - 41.1% peptide presenting high mass or isotopic deviation. Wed Aug 31 13:20:33 EDT 2016 - 8.9% unrecognized modifications. Wed Aug 31 13:20:33 EDT 2016 Parsing TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.msgf.mzid. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:20:48 EDT 2016 Loading spectra for TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.msgf.mzid. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:20:48 EDT 2016 Collecting peptides to map. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:20:49 EDT 2016 Mapping peptides to proteins. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:20:49 EDT 2016 Importing PSMs from TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.msgf.mzid 10% 20% 30% 40% 50% 60% 70% 80% 90% Wed Aug 31 13:20:50 EDT 2016 126 identified spectra (0.6%) did not present a valid peptide. Wed Aug 31 13:20:50 EDT 2016 126 of the best scoring peptides were excluded by the import filters: Wed Aug 31 13:20:50 EDT 2016 - 68.3% peptide mapping to both target and decoy. Wed Aug 31 13:20:50 EDT 2016 - 30.2% peptide presenting high mass or isotopic deviation. Wed Aug 31 13:20:50 EDT 2016 - 1.6% unrecognized modifications. Wed Aug 31 13:20:50 EDT 2016 Parsing TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.omx. Wed Aug 31 13:21:34 EDT 2016 Loading spectra for TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.omx. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:21:34 EDT 2016 Collecting peptides to map. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:21:35 EDT 2016 Mapping peptides to proteins. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:21:35 EDT 2016 Importing PSMs from TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.omx 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:21:44 EDT 2016 470 identified spectra (2.6%) did not present a valid peptide. Wed Aug 31 13:21:44 EDT 2016 38566 of the best scoring peptides were excluded by the import filters: Wed Aug 31 13:21:44 EDT 2016 - 89.7% peptide length less than 8 or greater than 30. Wed Aug 31 13:21:44 EDT 2016 - 10.3% peptide presenting high mass or isotopic deviation. Wed Aug 31 13:21:44 EDT 2016 Parsing TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.t.xml. Wed Aug 31 13:22:05 EDT 2016 Loading spectra for TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.t.xml. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:22:05 EDT 2016 Collecting peptides to map. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:22:05 EDT 2016 Mapping peptides to proteins. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:22:05 EDT 2016 Importing PSMs from TCGA_AR-A1AW_AR-A1AV_C8-A135_117C_P_BI_20130702_H-PM_f09.t.xml 10% 20% 30% 40% 50% 60% 70% 80% 90% Wed Aug 31 13:22:05 EDT 2016 3175 identified spectra (19.1%) did not present a valid peptide. Wed Aug 31 13:22:05 EDT 2016 5069 of the best scoring peptides were excluded by the import filters: Wed Aug 31 13:22:05 EDT 2016 - 35.2% peptide length less than 8 or greater than 30. Wed Aug 31 13:22:05 EDT 2016 - 58.6% peptide presenting high mass or isotopic deviation. Wed Aug 31 13:22:05 EDT 2016 - 5.8% unrecognized modifications. Wed Aug 31 13:22:05 EDT 2016 File import completed. 78559 first hits imported (439040 secondary) from 22926 spectra. Wed Aug 31 13:22:05 EDT 2016 [69029 first hits passed the initial filtering] Wed Aug 31 13:22:05 EDT 2016 Computing assumptions probabilities. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130% 140% 150% 160% 170% 180% 190% 200% Wed Aug 31 13:22:06 EDT 2016 Saving assumptions probabilities. 10% 20% 30% 40% 50% 60% 70% 80% 90% Wed Aug 31 13:22:10 EDT 2016 Selecting best peptide per spectrum. 10% 20% 30% 40% 50% 60% 70% 80% 90% Wed Aug 31 13:22:38 EDT 2016 Computing PSM probabilities. 10% 20% 30% 40% 50% Wed Aug 31 13:22:38 EDT 2016 Scoring PTMs in PSMs (D-score and PhosphoRS) Wed Aug 31 13:22:38 EDT 2016 Scoring PSM PTMs. Please Wait... 10% 20% 30% 40% 50% 60% 70% 80% 90% Wed Aug 31 13:23:43 EDT 2016 Resolving peptide inference issues. Wed Aug 31 13:23:43 EDT 2016 Peptide Inference. Please Wait... 10% 20% 30% 40% 50% 60% 70% 80% 90% Wed Aug 31 13:24:05 EDT 2016 Saving probabilities, building peptides and proteins. Wed Aug 31 13:24:05 EDT 2016 Attaching Spectrum Probabilities - Building Peptides and Proteins. Please Wait... 10% 20% 30% 40% 50% 60% 70% 80% 90% Wed Aug 31 13:24:07 EDT 2016 Simplifying protein groups. Wed Aug 31 13:24:07 EDT 2016 Symplifying Protein Groups. Please Wait... 10% 20% 30% 40% Wed Aug 31 13:24:08 EDT 2016 Removing Mapping Artifacts. Please Wait... Wed Aug 31 13:24:08 EDT 2016 76 unlikely protein mappings found: Wed Aug 31 13:24:08 EDT 2016 - 102 protein groups supported by non-enzymatic shared peptides. Wed Aug 31 13:24:08 EDT 2016 - 43 groups explained by a simpler group. Wed Aug 31 13:24:08 EDT 2016 Note: a group can present combinations of these criteria. Wed Aug 31 13:24:08 EDT 2016 Generating peptide map. Wed Aug 31 13:24:08 EDT 2016 Filling Peptide Maps. Please Wait... 10% 20% 30% 40% Wed Aug 31 13:24:08 EDT 2016 Computing peptide probabilities. Wed Aug 31 13:24:08 EDT 2016 Estimating Probabilities. Please Wait... 10% 20% 30% Wed Aug 31 13:24:08 EDT 2016 Saving peptide probabilities. Wed Aug 31 13:24:08 EDT 2016 Attaching Peptide Probabilities. Please Wait... 10% 20% 30% 40% 50% 60% 70% 80% 90% Wed Aug 31 13:24:08 EDT 2016 Generating protein map. Wed Aug 31 13:24:08 EDT 2016 Filling Protein Map. Please Wait... 10% 20% 30% 40% 50% 60% 70% 80% 90% Wed Aug 31 13:24:09 EDT 2016 Resolving protein inference issues, inferring peptide and protein PI status. Wed Aug 31 13:24:09 EDT 2016 Simplifying Redundant Protein Groups. Please Wait... 10% Wed Aug 31 13:24:36 EDT 2016 Inferring PI status, sorting proteins. Please Wait... 20% 30% 40% 50% 60% Wed Aug 31 13:24:40 EDT 2016 6602 conflicts resolved. 5743 protein groups remaining (0 suspicious). Wed Aug 31 13:24:40 EDT 2016 Correcting protein probabilities. Wed Aug 31 13:24:40 EDT 2016 Estimating Probabilities. Please Wait... 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:24:40 EDT 2016 Saving protein probabilities. Wed Aug 31 13:24:40 EDT 2016 Attaching Protein Probabilities. Please Wait... 10% 20% 30% 40% 50% 60% 70% 80% 90% Wed Aug 31 13:24:41 EDT 2016 Validating identifications, quality control of matches. Wed Aug 31 13:24:41 EDT 2016 Finding FDR Thresholds. Please Wait... 10% 20% 30% 40% 50% 60% 70% 80% 100% 110% 120% 130% 140% Wed Aug 31 13:24:41 EDT 2016 Match Validation and Quality Control. Please Wait... 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Wed Aug 31 13:25:12 EDT 2016 Scoring PTMs in peptides. Wed Aug 31 13:25:12 EDT 2016 Scoring Peptide PTMs. Please Wait...

PeptideShaker processing failed. See the PeptideShaker log for details.

Wed Aug 31 13:25:14 EDT 2016 An error occurred while loading the identification files: Wed Aug 31 13:25:14 EDT 2016 Attempting to create duplicate peptide key: YHGHSMSDPGVSYR_15.99491461956-ATAA-6_79.96633052074999_79.96633052074999-ATAA-5 from peptide YHGHSMSDPGVSYR_15.99491461956-ATAA-6_79.96633052074999-ATAA-12_79.96633052074999-ATAA-5. java.lang.IllegalArgumentException: Attempting to create duplicate peptide key: YHGHSMSDPGVSYR_15.99491461956-ATAA-6_79.96633052074999_79.96633052074999-ATAA-5 from peptide YHGHSMSDPGVSYR_15.99491461956-ATAA-6_79.96633052074999-ATAA-12_79.96633052074999-ATAA-5. at eu.isas.peptideshaker.ptm.PtmScorer.scorePTMs(PtmScorer.java:974) at eu.isas.peptideshaker.ptm.PtmScorer.scorePeptidePtms(PtmScorer.java:1482) at eu.isas.peptideshaker.PeptideShaker.processIdentifications(PeptideShaker.java:496) at eu.isas.peptideshaker.fileimport.FileImporter$IdProcessorFromFile.importFiles(FileImporter.java:559) at eu.isas.peptideshaker.fileimport.FileImporter.importFiles(FileImporter.java:158) at eu.isas.peptideshaker.PeptideShaker.importFiles(PeptideShaker.java:234) at eu.isas.peptideshaker.cmd.PeptideShakerCLI.createProject(PeptideShakerCLI.java:732) at eu.isas.peptideshaker.cmd.PeptideShakerCLI.call(PeptideShakerCLI.java:220) at eu.isas.peptideshaker.cmd.PeptideShakerCLI.main(PeptideShakerCLI.java:945) Free memory: 8933135512 Wed Aug 31 13:25:14 EDT 2016 PeptideShaker Processing Canceled.

PeptideShaker processing canceled. See the PeptideShaker log for details.

If you require additionnal information, do not hesitate to contact me. We have developped a debug mode that enables us to keep all temporary files generated by our compomics tools pipeline if it fails on a compute node. We can also send you those files if required. Thanks again for you help,

JF

chambm commented 7 years ago

+1, I see this too with MS-GF+ searches of a RefSeq database. Default parameters. SearchGUI 3.1.4, PeptideShaker 1.13.6.

mvaudel commented 7 years ago

We are working on it, it is a non trivial threading issue. Would it be possible to share the SearchGUI output, mgf, and fasta files with us?

Thanks!

Marc

chambm commented 7 years ago

Unfortunately this data isn't published, but I can try to recreate it with some published data. This occurred on a big Jetstream box with 44 cores, so I can see how that could dredge up some hidden race condition. If I force it to run PeptideShaker with 1 thread, will it always succeed?

mvaudel commented 7 years ago

Somehow it seems to happen with one thread as well for some users, which really puzzles me, but I have issues finding a dataset where I can reproduce it easily... I can test with 44 cores as well if you have a dataset where this happens reproducibly. Note that if you share data with us they will be handled confidentially and I will not look at the results :)

hbarsnes commented 7 years ago

We've just released new versions of both SearchGUI (v3.2.6) and PeptideShaker (v1.15.0). Would be great if you could see if the issue has been resolved in the new versions?

chambm commented 7 years ago

I'm still getting an error running with 1.15. Since 1.15 isn't on Galaxy ToolShed yet, I ran it on my local Windows box. I didn't try changing SearchGUI version, but I've tried with 4 different search engines (MS-GF+, MyriMatch, X!Tandem, Comet). Actually I get a different error with MyriMatch on Galaxy:

Sat Feb 04 13:04:20 CST 2017 An error occurred while loading the identification files: Sat Feb 04 13:04:20 CST 2017 Index: 0, Size: 0 Sat Feb 04 13:04:21 CST 2017 PeptideShaker Processing Canceled.

PeptideShaker processing canceled. See the PeptideShaker log for details.

And also a different error when I processed this zip locally:

Mon Feb 06 11:21:17 CST 2017 An error occurred: Log Record has been sent to the stream, but it cannot be applied to the store (Object null). This may cause recovery problems also. Mon Feb 06 11:21:17 CST 2017 Please contact the developers (https://github.com/compomics/peptide-shaker/issues).

Do you have a place I can dump the searchgui file?

mvaudel commented 7 years ago

Hi Matt,

I have been able to reproduce the error when running with low memory and many processors on the dataset of Gerben (thanks @gerbenm !). It seems that our back-end database does not cope well with the flow of queries. I had a hard time debugging this one but think I am close to a fix. If you can wait until tomorrow we will release new versions :)

Apologies for the inconvenience,

Marc

hbarsnes commented 7 years ago

Hi Matt,

We've just released PeptideShaker v1.15.1 which should hopefully fix the issues. Please let is know if this is not the case on your datasets?

Best regards, Harald

chambm commented 7 years ago

I downloaded it to my desktop to try it (on the data I mentioned earlier). It seemed to get past the "Object null" error and, after a long time spent on probability calculation, map generation, and inference issues, it canceled the import mysteriously with no error: https://gist.github.com/chambm/f9803bffe171bf6d93c9f8e6f8d620f6

Why are those last phases so slow? When I import these pepXML files with IDPicker, it takes about 5 minutes to do filtering and assembly. :/

mvaudel commented 7 years ago

Hi Matt,

Can you send us the log of the tool? You can find it via the Help > Bug Report menu.

It seems that the tool is taking a lot of time saving things to the back-end database (all map generation and saving steps). This can be explained by low RAM and IO limitations. Can you go to the Java Settings menu (under Edit in the main interface or Settings in the start-up dialog) and make sure that the total amount of RAM available on your computer is available to the tool? Also, make sure that you have your PeptideShaker folder as well as all your files locally (no remote folder and no synced folder) and preferentially on SSD discs. Beside this the only things that seem to take time on your file is the scoring of PTM localization and the alignment of non confidently localized PTMs (Peptide Inference).

I am sure that if PeptideShaker would do only filtering and assembly it would also take five minutes, but we need to keep track of much more information, and it seems to saturate your RAM. The good news is that we are refactoring our back-end to be more efficient - it will take a bit of time before we can release this though :)

Hope this helps,

Marc

chambm commented 7 years ago

I upgraded to a new workstation yesterday; old one had 16gb RAM, new one has 32gb. I set PS' Java limit to 24gb (when I set it to 32gb, it seemed to hang indefinitely with a marquee progress bar after importing my SearchGUI zip file). This 16 fraction sample may still be pushing that to the limit: image

What is up with the GPU usage, btw? For pretty much the whole import, Java has been pegging my Quadro 310 to 100%. Are you doing some GPU accelerated analysis here? :)

On the old workstation I was reading off an old SSD, the new one it's reading off a HDD. New one has a different processor, but they shouldn't be too different in performance; bit of a downgrade actually: http://cpuboss.com/cpus/Intel-Xeon-E5-2630-vs-Intel-Core-i7-2700K :(

So taking these differences into account, many of the steps improved a lot but it slowed down at "Resolving protein inference issues" (which I consider part of IDPicker's 'assembly and filtering'):

Thu Feb 09 11:11:10 CST 2017 File import completed. 305305 first hits imported (2978749 secondary) from 315687 spectra. Thu Feb 09 11:11:10 CST 2017 [291016 first hits passed the initial filtering] Thu Feb 09 11:11:10 CST 2017 Computing assumptions probabilities. Thu Feb 09 11:11:11 CST 2017 Saving assumptions probabilities. Thu Feb 09 11:11:23 CST 2017 Selecting best peptide per spectrum. Thu Feb 09 11:11:36 CST 2017 Computing PSM probabilities. Thu Feb 09 11:11:36 CST 2017 Scoring PTMs in PSMs (D-score and PhosphoRS) Thu Feb 09 11:12:25 CST 2017 Resolving peptide inference issues. Thu Feb 09 11:15:07 CST 2017 Saving probabilities, building peptides and proteins. Thu Feb 09 11:15:11 CST 2017 Simplifying protein groups. Thu Feb 09 11:15:32 CST 2017 941 unlikely protein mappings found: Thu Feb 09 11:15:32 CST 2017 - 1647 protein groups supported by non-enzymatic shared peptides. Thu Feb 09 11:15:32 CST 2017 - 398 groups explained by a simpler group. Thu Feb 09 11:15:32 CST 2017 Note: a group can present combinations of these criteria. Thu Feb 09 11:15:32 CST 2017 Generating peptide map. Thu Feb 09 11:15:34 CST 2017 Computing peptide probabilities. Thu Feb 09 11:15:34 CST 2017 Saving peptide probabilities. Thu Feb 09 11:15:35 CST 2017 Generating protein map. Thu Feb 09 11:15:36 CST 2017 Resolving protein inference issues, inferring peptide and protein PI status. Thu Feb 09 11:45:29 CST 2017 40653 conflicts resolved. 28756 protein groups remaining (36 suspicious). Thu Feb 09 11:45:29 CST 2017 Correcting protein probabilities. Thu Feb 09 11:45:29 CST 2017 Saving protein probabilities. Thu Feb 09 11:45:31 CST 2017 Validating identifications at 1% FDR, quality control of matches. Thu Feb 09 11:46:30 CST 2017 Scoring PTMs in peptides. (30% through this step after 35 minutes)

If this run fails again I'll come back with the log. I may have deleted the previous log after posting that gist transcript.

mvaudel commented 7 years ago

Hi again,

Great to read that it is still running! I am not sure why it takes so much memory on this dataset. I seems to be only ~3 million peptides and I use to run such analyses on my laptop with much less memory. It would be really great if you could share the data with us at a later stage so that I can profile the application :)

The Resolving protein inference issues part inspects whether some protein ambiguity groups can be explained by simpler ones. It also checks the similarity of the protein description of the proteins in each group, and whether they seem to come from related genes. From there it infers whether the group is formed of related proteins or not in order to make it easier for the user to find groups of isoforms. Based on this, a consensus leading protein is selected for the group. It also annotates which peptides come from which group and whether the peptides belong to groups of related proteins or not. While we are iterating protein groups, we also gather information on the number of peptides, PSMs, molecular weight, and number of proteins per group. We also use these to create a list of sorted proteins used later on. This is why it is taking a bit of time :)

Hope this helps,

Marc

chambm commented 7 years ago

Yay, it finished! Scoring PTMs took 2 hours, but no duplicate error! Here's the entire log from Bug Reports.

Thanks for the explanation about inference issues. It sounds similar to what IDPicker's doing with the exception of comparing descriptions to group isoforms (it instead relies on protein ambiguity grouping to do that, and maps to the gene to allow grouping proteins by gene if the user wants).

Could the performance of inference be degraded because I'm using a custom "generic|" database which includes the RNA-Seq RPKM in the description? generic|NP_000963|919.6986|NM_000972|RPL7A|60S ribosomal protein L7a. PS seems to parse the protein accession out just fine though.

I was surprised when you said 3 million peptides, so I looked again at the numbers. It's 300k first hits and 3 million 'secondary' hits. Because SearchGUI defaults to the top 10 hits (!) for Comet. Does PeptideShaker process (and somehow use) all those secondary hits? If so, that accounts for most of the difference in memory and time between PS and IDPicker. IDPicker deals only with the top hits for pretty much everything; it only shows secondary hits when looking at detailed results for an individual spectrum.

I'll email you a link to the SearchGUI file.

mvaudel commented 7 years ago

Hi again,

Great news! Maybe the time spent in protein inference is due to the complexity of the groups, the description should not make too much of a difference :)

Yes, secondary hits are used to merge the results from different search engines and to score PTMs. 10 hits is a good start then, ideally you need hits until reaching a PEP of 1.

Good to hear that it worked out - hopefully it will be easier to improve the performance than fixing the back-end problems :)

Best regards,

Marc

mvaudel commented 7 years ago

Hi Matt,

I have been running the dataset on my side as well and it worked fine. The entire processing took 34 minutes using 20 GB of memory on my university laptop which does not compete with the specs of your server. I am not sure what could explain that things are slower on your side. We will release a new version later today, it would be great if you could test it. The tool needs to read and write a lot in the folder where the jar file is located (unless you changed the location of its resource folder). Can it be that there is some limitation there?

On these 34 min ~26 were spent in the protein inference step confirming what you observed. It is the first time that I see such a strange behaviour. I profiled the process and ~90% of the time was spent reading/writing to the back-end database. I will see if I can find a way to make this more efficient.

Thanks again for sharing the files,

Marc

chambm commented 7 years ago

With 1.16 I got about what you did, ~35 minutes and most of it in protein inference. But for most of that time, my process wasn't reading/writing from/to disk, it was just churning CPU. Thanks for investigating!

BTW, I suggest including a "processing finished at XXX" line in the PS log.

mvaudel commented 7 years ago

Hi Matt,

Yes, I saw the same thing on my side, a lot of time spent in the database not doing anything. I will see if I can get this to be more efficient.

Thanks for the tip, will add this line to the log :)

Will keep you posted,

Marc

xthua commented 7 years ago

hi, I met the same problem "Attempting to create duplicate peptide key". Is it solved via update the server?

jflucier commented 7 years ago

Hello,

we still have this problem. The method we used is to simply rerun sample analysis since this bug seems to be random and not so frequent. Out of the 25357 mzxml files analysed on our cluster, only 386 failed (not only du to this error but all errors). So 1.5% failure rate. Best stats i can popup at the moment.

On a side note, our main problem with failure tracking is that some errors are "silent". SearchGUI & PeptideShaker doesn't necesseraly return -1 when the program fails so its execute to the end and generates an empty report so its hard to control error processing. This bug was corrected previously and it reappeared. Our installation uses SearchGUI-3.1.0 and PeptideShaker-1.13.4

JF

On Wed, Sep 6, 2017 at 10:13 AM, xthua notifications@github.com wrote:

hi, I met the same problem "Attempting to create duplicate peptide key". Is it solved via update the server?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/compomics/peptide-shaker/issues/56#issuecomment-327496383, or mute the thread https://github.com/notifications/unsubscribe-auth/ADV9oDpGAmlM92-hbdWXAQJJ038j_mFTks5sfqibgaJpZM4KN43h .

xthua commented 7 years ago

Thanks for the reply. I would try the two softwares with the version you used.

emassign commented 6 years ago

Hi everyone, I have recently started using SearchGUI and Peptide Shaker and, even with the newest version of peptide Shaker (1.16.27) and the latest java release (1.8), I still have this problem with duplicated peptide keys.

Thu Aug 16 11:19:38 CEST 2018 Import process for Histones (Sample: Tryp_Urea, Replicate: 1)

Thu Aug 16 11:19:38 CEST 2018 Importing sequences from HUMAN_HISTONES_1502_All_concatenated_target_decoy.fasta. Thu Aug 16 11:19:40 CEST 2018 FASTA file import completed. Thu Aug 16 11:19:40 CEST 2018 Importing gene mappings. Thu Aug 16 11:19:40 CEST 2018 Establishing local database connection. Thu Aug 16 11:19:41 CEST 2018 Reading identification files. Thu Aug 16 11:19:41 CEST 2018 Parsing DM_hmSILAC_Test_Tryp_Ambic.msgf.mzid. Thu Aug 16 11:19:49 CEST 2018 Loading spectra for DM_hmSILAC_Test_Tryp_Ambic.msgf.mzid. Thu Aug 16 11:19:49 CEST 2018 Importing DM_hmSILAC_Test_Tryp_Ambic.mgf Thu Aug 16 11:19:49 CEST 2018 DM_hmSILAC_Test_Tryp_Ambic.mgf imported. Thu Aug 16 11:19:49 CEST 2018 Collecting peptides to map. Thu Aug 16 11:19:49 CEST 2018 Mapping peptides to proteins. Thu Aug 16 11:19:50 CEST 2018 Importing PSMs from DM_hmSILAC_Test_Tryp_Ambic.msgf.mzid Thu Aug 16 11:19:55 CEST 2018 35 identified spectra (0.1%) did not present a valid peptide. Thu Aug 16 11:19:55 CEST 2018 215 of the best scoring peptides were excluded by the import filters: Thu Aug 16 11:19:55 CEST 2018 - 21.9% peptide mapping to both target and decoy. Thu Aug 16 11:19:55 CEST 2018 - 78.1% peptide presenting high mass or isotopic deviation. Thu Aug 16 11:19:55 CEST 2018 File import completed. 26558 first hits imported (214938 secondary) from 43760 spectra. Thu Aug 16 11:19:55 CEST 2018 [26519 first hits passed the initial filtering] Thu Aug 16 11:19:55 CEST 2018 Computing assumptions probabilities. Thu Aug 16 11:19:55 CEST 2018 Saving assumptions probabilities. Thu Aug 16 11:19:56 CEST 2018 Selecting best peptide per spectrum. Thu Aug 16 11:19:58 CEST 2018 Computing PSM probabilities. Thu Aug 16 11:19:58 CEST 2018 Scoring PTMs in PSMs (D-score and PhosphoRS) Thu Aug 16 11:20:36 CEST 2018 Resolving peptide inference issues. Thu Aug 16 11:20:38 CEST 2018 Saving probabilities, building peptides and proteins. Thu Aug 16 11:20:39 CEST 2018 Simplifying protein groups. Thu Aug 16 11:20:39 CEST 2018 30 unlikely protein mappings found: Thu Aug 16 11:20:39 CEST 2018 - 13 protein groups supported by non-enzymatic shared peptides. Thu Aug 16 11:20:39 CEST 2018 - 6 protein groups explained by peptides shared to less confident mappings. Thu Aug 16 11:20:39 CEST 2018 - 13 groups explained by a simpler group. Thu Aug 16 11:20:39 CEST 2018 Note: a group can present combinations of these criteria. Thu Aug 16 11:20:39 CEST 2018 Generating peptide map. Thu Aug 16 11:20:39 CEST 2018 Computing peptide probabilities. Thu Aug 16 11:20:39 CEST 2018 Saving peptide probabilities. Thu Aug 16 11:20:40 CEST 2018 Generating protein map. Thu Aug 16 11:20:40 CEST 2018 Resolving protein inference issues, inferring peptide and protein PI status. Thu Aug 16 11:20:43 CEST 2018 219 conflicts resolved. 44 protein groups remaining (0 suspicious). Thu Aug 16 11:20:43 CEST 2018 Correcting protein probabilities. Thu Aug 16 11:20:43 CEST 2018 Saving protein probabilities. Thu Aug 16 11:20:43 CEST 2018 Validating identifications at 1% FDR, quality control of matches. Thu Aug 16 11:20:44 CEST 2018 Scoring PTMs in peptides.

Thu Aug 16 11:20:44 CEST 2018 Importing Data Canceled! Thu Aug 16 11:20:44 CEST 2018 An error occurred while loading the identification files: Thu Aug 16 11:20:44 CEST 2018 Attempting to create duplicate peptide key: KTEESSKNEEK_18.037835139130003-ATAA-9 from peptide KTEESSKNEEK_14.01565006414-ATAA-9_18.037835139130003-ATAA-9.

Is there a way to fix this?

Thank you in advance. Enrico

hbarsnes commented 6 years ago

Hi Enrico,

I'm afraid this is a known error that has also been reported by other users. The good news is that we are in the process of refactoring our code to remove this and other related issues (plus speeding up the processing).

The bad news is that I cannot at this time tell you when the new version will become available. We are getting close though. But we will let you know as soon as the new version is ready for beta testing.

Best regards, Harald