Open LuisFF opened 2 years ago
Hi @LuisFF, thanks for reporting this. I'll need to think for a bit since it's been a while since I wrote the code :)
I would be careful around changing the logic of get_proteins_by_id
, it's used in multiple places and not sure if it's safe to be able to only access the feature using its first protein ID, as proposed in your PR. Currently, it allows mapping from any of the potential protein IDs.
If I understand it correctly, this could also be solved by removing the protein_id
qualifier when entering the unique_protein_id
, correct? Here:
Hi @prihoda, thanks for looking into this.
Yes, removing the protein_id
qualifier from subsequent CDS
features with that same qualifier value is another way of solving this issue. That avoids any side effects I may overlooked in #79.
Great, would you be up to implementing that behavior in your PR? @LuisFF
Hi @prihoda , I pushed the suggested changes.
Please let me know if there's anything else needed.
First of all thanks for developing DeepBGC and making it available to the community.
I came across a bug in
HmmscanPfamRecordAnnotator
when generating theproteins_by_id
dictionary. Theutil
functionget_proteins_by_id
is currently looping through all the potential protein ids of a feature (e.g.unique_protein_id
,protein_id
andlocus_tag
) and this can cause features with id based onprotein_id
qualifier to be overwritten by another feature that shares the sameprotein_id
but it was deduplicated using theunique_protein_id
. This is causingPFAM_domain
features to be incorrectly placed in the genomic sequence becauseprotein_id
used inhmmscan
output file will match a different feature and pick the incorrect feature location.