Merck / deepbgc

BGC Detection and Classification Using Deep Learning
https://doi.org/10.1093/nar/gkz654
MIT License
123 stars 27 forks source link

bug in get_proteins_by_id affecting pfam annotator #78

Open LuisFF opened 2 years ago

LuisFF commented 2 years ago

First of all thanks for developing DeepBGC and making it available to the community.

I came across a bug in HmmscanPfamRecordAnnotator when generating the proteins_by_id dictionary. The util function get_proteins_by_id is currently looping through all the potential protein ids of a feature (e.g. unique_protein_id, protein_id and locus_tag) and this can cause features with id based on protein_id qualifier to be overwritten by another feature that shares the same protein_id but it was deduplicated using the unique_protein_id. This is causing PFAM_domain features to be incorrectly placed in the genomic sequence because protein_id used in hmmscan output file will match a different feature and pick the incorrect feature location.

prihoda commented 2 years ago

Hi @LuisFF, thanks for reporting this. I'll need to think for a bit since it's been a while since I wrote the code :)

I would be careful around changing the logic of get_proteins_by_id, it's used in multiple places and not sure if it's safe to be able to only access the feature using its first protein ID, as proposed in your PR. Currently, it allows mapping from any of the potential protein IDs.

If I understand it correctly, this could also be solved by removing the protein_id qualifier when entering the unique_protein_id, correct? Here:

https://github.com/Merck/deepbgc/blob/c836f081d029041aa1754ec4f21cba0ce826a034/deepbgc/util.py#L503-L505

LuisFF commented 2 years ago

Hi @prihoda, thanks for looking into this.

Yes, removing the protein_id qualifier from subsequent CDS features with that same qualifier value is another way of solving this issue. That avoids any side effects I may overlooked in #79.

prihoda commented 2 years ago

Great, would you be up to implementing that behavior in your PR? @LuisFF

LuisFF commented 1 year ago

Hi @prihoda , I pushed the suggested changes.

Please let me know if there's anything else needed.