NaegleLab / CoDIAC

Other
0 stars 0 forks source link

Seeing overlapping domains in InterPro annotations #15

Closed knaegle closed 1 year ago

knaegle commented 1 year ago

Description

Despite boundary harmonization in InterPro annotation of the human reference file, we are seeing some domains that overlap.

Examples

Examples include: VAV1 (P15498) - we get a VAV1_SH3 domain that overlaps with an SH3 domain TNS4 (Q8IZW8) - we see a PTB/PI_dom that overlaps with a PTB domain

Expected behavior

Desired behavior is to have only the universal domain annotation (i.e. SH3 and PTB) in these cases on the domain list and the architecture list.

Tasks

knaegle commented 1 year ago

Update: These cases are places where InterPro has two unlinked domains that overlap, the domains are close in size and they are not connected together in the hierarchy. Hence, these domains are not collapsable by either of our two current methods.

Screen Shot 2023-07-13 at 8 05 04 AM

I propose that the desired behavior is to select the more general of the domain family and this could be identified if we were able to know the number of domains of those types are in a proteome of interest.

If this is possible, then we update the Interpro collapse code to keep track of any domains that overlap by an overlap_threshold. We continue to remove/collapse domains as needed also by the length requirement, then the hierarchy requirement. We then check back on the pre-computed overlap and ask if any remaining set of of those still exist in the post-processed domains and then select the domain based on generality by number in a proteome.

@saqibrizvi11 - are you aware of how we might assess number of InterPro domains in a species by InterPro ID?

knaegle commented 1 year ago

Fixed InterPro boundaries by using the placement in the list - noted that the most generalized form of domains is always returned first. Final aspects of bugs related to multiple boundaries in list under a parent domain. Now removes all domains in a lower value list if one or more are overlapping with a domain above it in the list.

knaegle commented 1 year ago

Found an issue, in rare circumstances, where valid domains with boundaries are removed due to invalid domain boundaries later.

For example, see https://www.ebi.ac.uk/interpro/protein/UniProt/P19174/ where due to SH2 and SH3 domains overtaking the PH domains, the first PH domain is lost.

Will add removal of boundaries within the domain list specifically.

knaegle commented 1 year ago

Issue complete, new domains and annotations pushed to relevant repo.