Unique id generation ( Key (uniquename) ) for InterProScan import is not sufficient to capture all InterProScan entries

000generic commented 7 years ago

Hi! My import of InterProScan tsv files are failing due to duplicate identifiers that are generated by TBro. It is very rare but it is happening for at least one transcript in each of two transcriptomes, and it causes the import to fail completely (I think). If I understand the problem correctly, a possible solution might be to include an additional field of information in the generation of the unique identifier ( Key(uniquename) ).

To provide more details:

When I run:

interproscan# tbro-import annotation_interpro --organism_id 13 --release squid-T1 -i interproscan-5.22-61 tbro-interpro-fasta-split_tbro-transcriptome-4_ALLassemblers-sra- ONLY.okay.aa-0.tsv

thousands of lines from the tsv file are imported but then eventually I get the following before completion:

Error: SQLSTATE[23505]: Unique violation: 7 ERROR: duplicate key value violates unique constraint "feature_c1"

DETAIL: Key (organism_id, uniquename, type_id)=(13, squid-T1_squid-T4-transabyss-100bp-kmer44-12760-aa_SSF48371_1729_1752_SUPERFAMILY, 45828) already exists.

Type "/bin/tbro-import --help" to get help. Type "/bin/tbro-import --help" to get help on specific command.

When I check, there are two lines in my tsv file that will generate identical unique identifiers ( Key(uniquename) ) when built from the fields that it looks like TBro is using. For example:

squid-T1_squid-T4-transabyss-100bp-kmer44-12760-aa_SSF48371_1729_1752_SUPERFAMILY

is the identifier generated for both:

squid-T4-transabyss-100bp-kmer44-12760-aa a53f8a0be749118e8a6ef69a1fc2b206 3778 SUPERFAMILY SSF48371 1729 1752 5.83E-8 T 28-04-2017 IPR016024 Armadillo-type fold GO:000548

vs

squid-T4-transabyss-100bp-kmer44-12760-aa a53f8a0be749118e8a6ef69a1fc2b206 3778 SUPERFAMILY SSF48371 1729 1752 3.54E-5 T 28-04-2017 IPR016024 Armadillo-type fold GO:0005488

In general, I think building unique identifiers ( Key (uniquename) ) to include the field in bold will solve the problem. For now it is easy for me to remove the duplicates but this is less than ideal, as I am then losing part of the annotation.

The failure is for 1 line out of over 500,000 - so a very small problem - but if its not too much trouble, it might be worth solving, if only to make TBro robust to diverse situations. Or it if makes more sense to have only one of the two lines included for import, it might be helpful to mention in the documentation that users need to identify and remove duplicates from InterProScan tsv files prior to import.

Thank-you!

iimog commented 7 years ago

Thanks for reporting this problem. I will have a look at this. I fully agree that this is something that should be handled by TBro and not require manual action by the user. The unique constraint is to avoid adding the same annotation twice. In this case the annotation is the same just with different e-values. If multiple entries with different e-values should be treated differently I can extend the unique constraint to include e-value. Otherwise I can catch this exception and skip lines that are identical except the e-value. Which solution would you prefer @000generic ?

000generic commented 7 years ago

I think it would make sense to drop the annotations that have larger e-values. Maybe there are specific/limited reasons to keep the larger e-value annotations at times - but the region of sequence under annotation is identical - and the annotation itself is identical - so in general I think keeping only the annotation with the best (smallest) e-value makes sense - and identical annotations of larger e-values can be dropped. Or that's my sense of things for myself and I would guess for many other typical uses.

TBroTeam / TBro

Unique id generation ( Key (uniquename) ) for InterProScan import is not sufficient to capture all InterProScan entries #51