NaegleLab / CoDIAC

GNU General Public License v3.0
0 stars 0 forks source link

Replacing Inter #46

Closed campbellla closed 3 months ago

campbellla commented 3 months ago

Changed the InterPro.py module to only pull from the InterPro database and retrieve hierarchy determined by InterPro's extra fields. These changes should fix prior boundary issues in the interpro metadata fetching, at least when tested locally on UBR, SH2, and SH3 domains. Currently does not allow any domain overlap.

knaegle commented 3 months ago

@campbellla

Found some issues with data losses as we applied this new InterPro data to the SH2 domain (see in SH2_domain_contact_mapping for full Uniprot ID set).

These proteins are no longer reporting an SH2 domain O14508|SOCS2 O14512|SOCS7 O14544|SOCS6 O75159|SOCS5 Q8WXH5|SOCS4 Q7KZ85|SUPT6H

Also, new behavior, differences in reporting domain boundaries - this is a good fix, the original boundaries were incorrect. ->Q7Z4S9|SH2D6|1|63|173 +>Q7Z4S9|SH2D6|1|223|333

knaegle commented 3 months ago

Noting behavior changes in the SH2 domain as a family, based on new InterPro fetch. FES/FER - now returning the F_BAR domain instead of FCH. JAK family - now returning the Band41 and a JAK-specific FERM C-lob, instead of a larger FERM domain

These changes are consistent with Interpro's representative domain architecture and appear to be what is consistent with what should be returned.

INPPL (SHIP) no longer returning the IPPc domain, consistent with Uniprot, likely this came from querying all sources in the past.

Additionally, note that a couple of new proteins pop in on search that don't have SH2 domains (one PIK3catalytic protein e.g.).