NaegleLab / CoDIAC

GNU General Public License v3.0
0 stars 0 forks source link

Found "ghost" architectures missing domain annotations in PDB annotated file #42

Closed knaegle closed 5 months ago

knaegle commented 5 months ago

Description

We found we were missing some structures (7T1L for Fes the superbinder, e.g.). Upon looking we found that Uniprot record accurately captures this structure, the PDB search occurs, but somehow there is a blank entry in the domain columns, but the architecture suggests that domains were correctly extracted. This leads to filtering out of these structures, which looks for the Intepro ID in the domain list, and therefore missing that structure.

Files

PDB_annotated, filter the domain column on blanks, there are on the order of 50 or so structures that were clearly domain annotated at some point, but now missing. There isn't a clear pattern to the structures where this happened (i.e. some are mutant, some are not, many of them are ligand containing, but unclear if all of them are.

Examples

7TL1 3U23 5NWM 1WQ1

Expected behavior

Domain annotations exist

Tasks

Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at, if known

knaegle commented 5 months ago

Update: This all appears to be working as expected. The architecture column was for the full protein, indicating the protein was found. The blank entries for the structure indicate that no domain structure was found within the experimental structure.

I added a feature to print the full uniprot domain structure, so it's easier in place to compare the boundaries of the structure and that of the full protein.

In digging into a few cases: 7T1L (FES superbinder) - the issue here is that there are so may alterations between the experiment sequence and the reference, that we catch by alignment only a segment of the boundary. Our alignment captures only 'KPLHEQLWYHGAIPRAEVAELLVHSGDFLVRESETV' of the SH2 domain of Fes within the structure. Honestly, this feels like the right behavior. There are gaps and mutations relative to wild type Fes that this is a somewhat unrecognizable sequence

Screenshot 2024-01-17 at 8 33 07 AM

3U23 (RIN3): Issue here is RIN3 SH2 domain containing component is not what was crystalized. It was a segment of RIN3 containing a pTyr as a ligand to CD2AP. Hence, the SH2 domain was not covered.

3MTT (PIK3R2): This covers 433-612, which spans neither SH2 domain - but is instead likely focused on that alpha helical region.

knaegle commented 5 months ago

Closing this, @alekhyaa2 no need to do final 2 tasks. Making a pull request to hook the branches of CoDIAC and SH2 data repos to main.