a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.02k stars 131 forks source link

add PDB manager #270 #271

Closed a-r-j closed 1 year ago

a-r-j commented 1 year ago

Reference Issues/PRs

270 @amorehead

What does this implement/fix? Explain your changes

Adds a utility for creating selections of experimental PDB structures

What testing did you do to verify the changes in this PR?

WIP

Draws the following metadata:

id | pdb | chain | length | molecule_type | name | sequence | ligands | source | resolution | experiment_type
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
100d_A | 100d | A | 10 | na | DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP... | CCGGCGCCGG | [SPM] |   | 1.90 | diffraction
100d_B | 100d | B | 10 | na | DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP... | CCGGCGCCGG | [SPM] |   | 1.90 | diffraction
101d_A | 101d | A | 12 | na | DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C... | CGCGAATTCGCG | [CBR, MG, NT] |   | 2.25 | diffraction
101d_B | 101d | B | 12 | na | DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C... | CGCGAATTCGCG | [CBR, MG, NT] |   | 2.25 | diffraction
101m_A | 101m | A | 154 | protein | MYOGLOBIN | MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR... | [HEM, NBN, SO4] | Physeter catodon | 2.07 | diffraction
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ...
9xia_A | 9xia | A | 388 | protein | XYLOSE ISOMERASE | MNYQPTPEDRFTFGLWTVGWQGRDPFGDATRRALDPVESVQRLAEL... | [DFR, MN] | Streptomyces rubiginosus | 1.90 | diffraction
9xim_A | 9xim | A | 393 | protein | D-XYLOSE ISOMERASE | SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG... | [MN, XLS] | Actinoplanes missouriensis | 2.40 | diffraction
9xim_B | 9xim | B | 393 | protein | D-XYLOSE ISOMERASE | SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG... | [MN, XLS] | Actinoplanes missouriensis | 2.40 | diffraction
9xim_C | 9xim | C | 393 | protein | D-XYLOSE ISOMERASE | SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG... | [MN, XLS] | Actinoplanes missouriensis | 2.40 | diffraction
9xim_D | 9xim | D | 393 | protein | D-XYLOSE ISOMERASE | SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG... | [MN, XLS] | Actinoplanes missouriensis | 2.40 | diffraction

Currently missing:

Pull Request Checklist

sonarcloud[bot] commented 1 year ago

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot E 1 Security Hotspot
Code Smell A 4 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

a-r-j commented 1 year ago

@amorehead I've added the clustering utils - does this cover what you were hoping for?

You also mentioned some structural clustering - I also think this would be good. Do you have a preferred method?

amorehead commented 1 year ago

Hi, @a-r-j.

All these changes look great! I've gone ahead and created another pull request using a personal fork of Graphein's latest master branch. In particular, I've revised some of the documentation for each class method, and more importantly, I've added initial support for splitting the e.g., clustered sequence dataset into an arbitrary number of "splits" (e.g., train, val, and test). Also, feel free to push changes directly to this forked branch of mine if you would like to make additions or edits to it. My hope is that we can use this pull request to finish developing the remaining functionality listed. Let me know if you have any questions, comments, or concerns.

Looking forward to the final result!