OpenMS / OpenMS

The codebase of the OpenMS project
https://www.openms.de
Other
478 stars 318 forks source link

Duplicated transition IDs in TargetedExperiment object from PQP source, caused by decoy peptides with same sequence but different gene #5653

Open gureann opened 2 years ago

gureann commented 2 years ago

Hi @hroest ,

I'm using OpenSwath to analyze DIA data acquired from QEHF, and an error occurred when I ran OSW with generated pqp file. The error was caused by some decoy peptides which got same sequence from different target peptide sequences (also belonged to different genes), and this lead to duplicated transition IDs at the pqp reading step for aggregation of gene table

If the input mzml was converted from thermo raw file by msconvert without peak picking, there will be no exception raised, and just stopped when searching, like this

Thread 3_0 will analyze 12911 compounds and 72038 transitions from SWATH 5 (batch 2 out of 12)
Thread 6_0 will analyze 14914 compounds and 82015 transitions from SWATH 4 (batch 2 out of 14)
Thread 0_0 will analyze 10545 compounds and 55144 transitions from SWATH 2 (batch 3 out of 10)
Thread 2_0 will analyze 14011 compounds and 74796 transitions from SWATH 3 (batch 3 out of 14)
(base) PS F:\>

When the input mzml was converted with peak picking, the error will be invalid ID

...
Thread 0_0 will analyze 10545 compounds and 55144 transitions from SWATH 2 (batch 3 out of 10)
Thread 4_0 will analyze 7360 compounds and 37058 transitions from SWATH 1 (batch 4 out of 7)

---------------------------------------------------
FATAL: uncaught exception!
---------------------------------------------------
last entry in the exception handler:
exception of type InvalidValue occurred in line 165, function void __cdecl OpenMS::MRMTransitionGroup<class OpenMS::MSChromatogram,struct OpenSwath::LightTransition>::addTransition(const struct OpenSwath::LightTransition &,const class OpenMS::String &) of C:\jenkins\ws\openms\RC\openms_release_packaging\9447518b\source\src\openms\include\OpenMS/KERNEL/MRMTransitionGroup.h
error message: the value '1747343' was used but is not valid; Internal error: Transition with nativeID was already present!
---------------------------------------------------

If I use TargetedFileConverter again, from pqp to tsv, the error will be raised correctly, in the checking step after reading database and generating TargetedExperiment

Progress of 'reading PQP file (SQL warmup)':
-- done [took 36.64 s (CPU), 36.82 s (Wall)] --
Progress of 'reading PQP file':
-- done [took 7.72 s (CPU), 7.71 s (Wall)] --
Progress of 'conversion to internal data representation':
-- done [took 5.11 s (CPU), 5.10 s (Wall)] --
Found duplicate transition id (must be unique): 1292421
Error: Unexpected internal error (Invalid input, contains duplicate or invalid references)

The file attched below (extracted from pqp file) is all transitions with duplicated IDs after running DecoyGenerator. example_of_same_seq_in_diff_genes.txt Two kinds of decoy peptides with same sequence: Peptide FVQDLSK belongs to Q91ZJ5;DECOY_P52196, in which DECOY_P52196 has original proteinID P52196 with a peptide FQLVDSR, gene name of these two is Ugp2 and Tst Peptide YLDLLQK belongs to protein group DECOY_Q0KK55;DECOY_Q6PHN7, and the original sequence is YLLDLLR and YLLQLLR, with one AA difference, belongs to Tmem164 and Kndc1 respectively (after shuffle, protein are combined but gene are individually kept)

Currently I directly dropped decoy peptides which have same sequences as targets and same decoy peptide sequence belong to different genes, when assay file was still in tsv format before converting to pqp and it worked fine now

Maybe this case is rare since it needs both genes assgined and same sequences from randomly generated decoy peptides I'd suggest an optional parameter to control if the decoys are allowed as same as target ones, or just filter them out. And a checking step for pqp file in OSW will be great, like that in TargetedFileConverter, to find some invalid items before next step.

Best regards, Ronghui

shubham1637 commented 2 years ago

Hi, I am seeing similar issues. I do not have any decoy and target peptide that have common sequence, even then I see this happening. Do you know how to fix it?

gureann commented 2 years ago

Hi @shubham1637 , looking back at this issue again, I think the main problem was caused by different genes were assigned to same one peptide sequence, and the decoys in my first narrative was only one way to reach it.

Which means if peptideA has GeneI in some rows and GeneII in other rows, this would lead to duplicated transition IDs, since the join action for tables in PQP file will also use gene column. Different genes would be kept, and any other values that were same would be repeated.

If you get same error in second code block when running OSW, and error in third code block when runnning TargetedFileConverter, I think you can have a look at the genes (or protein groups?) in your tsv or pqp file.

Hope this would be helpful.

Best, Ronghui

shubham1637 commented 2 years ago

You r right. I removed gene column altogether and it doesn't throw error anymore. Thanks!

Best, Shubham

On Mon., Feb. 7, 2022, 11:40 p.m. Ronghui Lou, @.***> wrote:

Hi @shubham1637 https://github.com/shubham1637 , looking back at this issue again, I think the main problem was caused by different genes were assigned to same one peptide sequence, and the decoys in my first narrative was only one way to reach it.

Which means if peptideA has GeneI in some rows and GeneII in other rows, this would lead to duplicated transition IDs, since the join action for tables in PQP file will also use gene column. Different genes would be kept, and any other values that were same would be repeated.

If you get same error in second code block when running OSW, and error in third code block when runnning TargetedFileConverter, I think you can have a look at the genes (or protein groups?) in your tsv or pqp file.

Hope this would be helpful.

Best, Ronghui

— Reply to this email directly, view it on GitHub https://github.com/OpenMS/OpenMS/issues/5653#issuecomment-1032208366, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUNCXCHTK6V6265NXYXZUTU2CNDBANCNFSM5H4K2SZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

gureann commented 2 years ago

For anyone who reaches here,

The main problem of this issue would be caused by the assay library file itself, and I think this should be fixed by users ourselves, but not a issue for developers. So I would like to close this issue.

If you meet duplicated transition ID error, please check: only one unique gene and one unique protein was assigned to each peptide, but not two or more different ones appear in different rows

hroest commented 2 years ago

It seems this issue still persists

hroest commented 2 years ago

Part of the issue could come from the SQL select query here: https://github.com/OpenMS/OpenMS/blob/develop/src/openms/source/ANALYSIS/OPENSWATH/TransitionPQPFile.cpp which could lead to duplicated entries when you have 1:n mappings of peptides to proteins / genes. We should address this

2) we should also address the issue of decoy peptides with the same sequence