Development of `PDBManager` Class (WIP)

amorehead commented 1 year ago

Reference Issues/PRs

https://github.com/a-r-j/graphein/discussions/270 @a-r-j

What does this implement/fix? Explain your changes

Adds a utility for creating selections of experimental PDB structures

What testing did you do to verify the changes in this PR?

WIP

Draws the following metadata:

id | pdb | chain | length | molecule_type | name | sequence | ligands | source | resolution | experiment_type
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
100d_A | 100d | A | 10 | na | DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP... | CCGGCGCCGG | [SPM] |   | 1.90 | diffraction
100d_B | 100d | B | 10 | na | DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP... | CCGGCGCCGG | [SPM] |   | 1.90 | diffraction
101d_A | 101d | A | 12 | na | DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C... | CGCGAATTCGCG | [CBR, MG, NT] |   | 2.25 | diffraction
101d_B | 101d | B | 12 | na | DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C... | CGCGAATTCGCG | [CBR, MG, NT] |   | 2.25 | diffraction
101m_A | 101m | A | 154 | protein | MYOGLOBIN | MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR... | [HEM, NBN, SO4] | Physeter catodon | 2.07 | diffraction
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ...
9xia_A | 9xia | A | 388 | protein | XYLOSE ISOMERASE | MNYQPTPEDRFTFGLWTVGWQGRDPFGDATRRALDPVESVQRLAEL... | [DFR, MN] | Streptomyces rubiginosus | 1.90 | diffraction
9xim_A | 9xim | A | 393 | protein | D-XYLOSE ISOMERASE | SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG... | [MN, XLS] | Actinoplanes missouriensis | 2.40 | diffraction
9xim_B | 9xim | B | 393 | protein | D-XYLOSE ISOMERASE | SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG... | [MN, XLS] | Actinoplanes missouriensis | 2.40 | diffraction
9xim_C | 9xim | C | 393 | protein | D-XYLOSE ISOMERASE | SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG... | [MN, XLS] | Actinoplanes missouriensis | 2.40 | diffraction
9xim_D | 9xim | D | 393 | protein | D-XYLOSE ISOMERASE | SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG... | [MN, XLS] | Actinoplanes missouriensis | 2.40 | diffraction

Currently missing:

[x] Multiprocessing PDB download method.
[x] Chain extraction logic for downloaded PDB files.
[x] Initial support for providing an arbitrary number of dataset splits.
[x] Initial support for sequence-based splits using the clustering method of MMSeqs.
[x] Release date (i.e., time)-based dataset splits (e.g., for real-world prediction scenarios such as those in CASP).
[x] Tracking to which split a selection of PDB chains belong (e.g., train, val, and test).
[x] Being able to read in a sequence DataFrame from a FASTA file and to write out the current sequence DataFrame to either a FASTA file or a CSV file.
[x] The ability to use our sequence DataFrame to output (i.e., collate) the PDB entries of multiple chains corresponding to the same PDB ID as a single PDB file (i.e., structure file curation for potentially multiple PDB chains).

Pull Request Checklist

[x] Added a note about the modification or contribution to the ./CHANGELOG.md file (if applicable)
[ ] Added appropriate unit test functions in the ./graphein/tests/* directories (if applicable)
[x] Modify documentation in the corresponding Jupyter Notebook under ./notebooks/ (if applicable)
[x] Ran python -m py.test tests/ and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., python -m py.test tests/protein/test_graphs.py)
[x] Checked for style issues by running black . and isort .

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 39.87% and project coverage change: +3.60 :tada:

Comparison is base (8123f42) 40.27% compared to head (cd4db9e) 43.87%.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #272 +/- ## ========================================== + Coverage 40.27% 43.87% +3.60% ========================================== Files 48 113 +65 Lines 2811 7718 +4907 ========================================== + Hits 1132 3386 +2254 - Misses 1679 4332 +2653 ``` | [Impacted Files](https://codecov.io/gh/a-r-j/graphein/pull/272?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb) | Coverage Δ | | |---|---|---| | [graphein/ml/datasets/foldcomp\_dataset.py](https://codecov.io/gh/a-r-j/graphein/pull/272?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb#diff-Z3JhcGhlaW4vbWwvZGF0YXNldHMvZm9sZGNvbXBfZGF0YXNldC5weQ==) | `0.00% <0.00%> (ø)` | | | [graphein/ml/diffusion.py](https://codecov.io/gh/a-r-j/graphein/pull/272?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb#diff-Z3JhcGhlaW4vbWwvZGlmZnVzaW9uLnB5) | `0.00% <0.00%> (ø)` | | | [graphein/ml/metrics/\_\_init\_\_.py](https://codecov.io/gh/a-r-j/graphein/pull/272?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb#diff-Z3JhcGhlaW4vbWwvbWV0cmljcy9fX2luaXRfXy5weQ==) | `0.00% <0.00%> (ø)` | | | [graphein/ml/metrics/gdt.py](https://codecov.io/gh/a-r-j/graphein/pull/272?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb#diff-Z3JhcGhlaW4vbWwvbWV0cmljcy9nZHQucHk=) | `0.00% <0.00%> (ø)` | | | [graphein/ml/metrics/tm\_score.py](https://codecov.io/gh/a-r-j/graphein/pull/272?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb#diff-Z3JhcGhlaW4vbWwvbWV0cmljcy90bV9zY29yZS5weQ==) | `0.00% <0.00%> (ø)` | | | [graphein/ppi/graph\_metadata.py](https://codecov.io/gh/a-r-j/graphein/pull/272?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb#diff-Z3JhcGhlaW4vcHBpL2dyYXBoX21ldGFkYXRhLnB5) | `0.00% <0.00%> (ø)` | | | [graphein/ppi/visualisation.py](https://codecov.io/gh/a-r-j/graphein/pull/272?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb#diff-Z3JhcGhlaW4vcHBpL3Zpc3VhbGlzYXRpb24ucHk=) | `0.00% <0.00%> (ø)` | | | [graphein/protein/analysis.py](https://codecov.io/gh/a-r-j/graphein/pull/272?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb#diff-Z3JhcGhlaW4vcHJvdGVpbi9hbmFseXNpcy5weQ==) | `0.00% <0.00%> (ø)` | | | [graphein/protein/features/utils.py](https://codecov.io/gh/a-r-j/graphein/pull/272?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb#diff-Z3JhcGhlaW4vcHJvdGVpbi9mZWF0dXJlcy91dGlscy5weQ==) | `27.77% <0.00%> (ø)` | | | [graphein/protein/folding\_utils.py](https://codecov.io/gh/a-r-j/graphein/pull/272?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb#diff-Z3JhcGhlaW4vcHJvdGVpbi9mb2xkaW5nX3V0aWxzLnB5) | `0.00% <0.00%> (ø)` | | | ... and [95 more](https://codecov.io/gh/a-r-j/graphein/pull/272?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb) | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Arian+Jamasb)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

a-r-j commented 1 year ago

Wow, thanks for the refactor & additions! I added deposition dates for temporal splits.

I also had to rename the file as pdb was causing namespace clashes with the python debugger.

amorehead commented 1 year ago

@a-r-j, no problem! Happy to do so. Funny enough, earlier today, I accidentally also implemented parsing of deposition dates (should have checked my GitHub notifications beforehand). However, to minimize development style friction, I've conformed my implementation to reuse most parts of your implementation. In addition, I've also added support for time-based (i.e., deposition date) splits, as well as new PDB filters in this commit.

amorehead commented 1 year ago

One practical question about SonarCloud: Any ideas for how we can get around this security issue with this HTTP URL we have listed in our code? It looks like the HTTPS version of this URL does not exist, so I am unsure of how to proceed to work around this flag.

a-r-j commented 1 year ago

Eep! Sorry, should have dropped a note first.

Re SonarCloud, it's not a problem; I can manually override the flag since there's no HTTPS alternative (AFAIK).

a-r-j commented 1 year ago

Organization of output files (e.g., PDB files) into corresponding directories for each requested dataset split (e.g., train/, val/, and test/).

On this point, I think keeping all structures in one directory and maintaining split indexing via dataframes/csv files may be a better way to go. This way you can create an arbitrary number of splits without having lots of duplicate files and a complicated directory structure.

a-r-j commented 1 year ago

To avoid duplicating work, I'll get cracking on some tests

a-r-j commented 1 year ago

Structure-based dataset splits (e.g., using TMalign to compare structures using a 50% similarity threshold).

I think I implemented TMscore a while ago (with an inbuilt assumption of equal lengths): https://github.com/a-r-j/graphein/blob/master/graphein/ml/metrics/tm_score.py without the alignment procedure.

There's also a Kabsch implementation in Graphein:

https://github.com/a-r-j/graphein/blob/acaef3afccdcdba3708bc1006068e7bcf07dfd22/graphein/protein/tensor/geometry.py#L126

But I suppose this is not suitable as it requires correspondence between the points. I'm not super familiar with TMScore and what they use to align structures.

amorehead commented 1 year ago

@a-r-j, in general, the use cases of the suite of structure comparison tools are as follows:

(1) TM-score: For comparing the Ca atom structures of two "sequence-identical" structures perfectly using the Ca-atom amino acid sequences found in each of the two input PDB files. This performs a "one-to-one" comparison of two structures, assuming the sequences are not misaligned in any way (e.g., due to inconsistent indices for residue numbers).

(2) TM-align or US-align: Computing the TM-score of two structures multiple times using different alignments of their input sequences to find an "optimal" sequence alignment before reporting the final TM-score. This is referred to as structural alignment (compared to TM-score). In general, this is negligibly slower than TM-score's computations since the program is written in C++.

In short, if you have any suspicion that your two input PDBs may inconsistently index their residue numbers or may contain missing residues in the experimental structures, it is wise to use the "-align" based programs instead for your respective molecule type.

For us, I think having a simple Python wrapper around US-align to run structural alignments and similarity calculations would be ideal, as that would allow us to use this wrapper to compare structures of not only protein pair inputs but also e.g., RNA pair inputs.

Ref:

amorehead commented 1 year ago

@a-r-j, apologies for my unexpected delays on the pull request. For the time being, are you comfortable merging these changes, even though we still have the structure-based filtering left to implement?

On this bit, I'm thinking we might be able to skip using TM-score and instead rely on CATH fold classifications to determine how similar two proteins structurally are. This might be a much quicker way of comparing two structures given that all we would need to know in advance then is the CATH classification of the two input structures to make a decision for clustering/splitting. What do you think of this idea? Do you see any limitations of this approach compared to TM-score 0.5-thresholding?

a-r-j commented 1 year ago

In principle, yes happy to merge but I think a minimal test suite would be good. I did make a very small start on them (will push ASAP) but am a little occupied these days so don't have much capacity to build them out.

Re structural clustering, I was actually thinking we could wrap foldseek since it already appears to wrap TMAlign. Given we're already using MMSeqs, it probably makes the most sense to stay within Martin & Co's dependency universe.

amorehead commented 1 year ago

@a-r-j, this all sounds good to me. What kind of format or template do you use for your unit tests, at least in this particular use case? I may be able to help in this regard (possibly, depending on schedule bandwidth in the coming weeks).

Regarding, structure-based clustering, I really like the idea of using foldseek as well. I think this would be great timing to incorporate it here, since we now have access to a fast, streamlined way of clustering structures together. I think we should pursue a wrapper around foldseek as our first approach here rather than going for the other ideas I put out there previously.

review-notebook-app[bot] commented 1 year ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

a-r-j commented 1 year ago

@amorehead an edge case that has popped up in the tutorial: we need a way of handling proteins that are not available as PDB files (e.g. 6T9M).

We could add this as another column in df if we can figure out the underlying rules. I know that large structures (2700 residues iirc) are not available as PDBs. I also think there is a cutoff based on the number of chains. However, this doesn't seem to be the case for 6T9M. Do you know of any others?

amorehead commented 1 year ago

@a-r-j, off the top of my head, for these large PDB structures, I believe the PDB's convention is to use alternative file formats such as mmCIF. However, if so, that would mean we would have to add full-mmCIF functionality to the PDBManager as a secondary file type supported by the code. I am not a huge fan of this idea, since I like keeping the manager as .pdb file friendly and focused as possible. Is it at all possible to "convert" mmCIF files into their PDB file counterparts, for example, using a Python library of some kind? If so, maybe we can approach the problem in this way?

Of course, this whole idea assumes that we have access to these mmCIF files when we find out that a certain PDB structure is represented by an mmCIF file (which we currently don't have a rule for determining in advance yet).

a-r-j commented 1 year ago

AFAIK conversion isn't the most reliable process. I added rudimentary capability to biopandas:

https://github.com/BioPandas/biopandas/blame/79e6599e52c4b242164a9b1bbbda068bacbbee1c/biopandas/mmcif/pandas_mmcif.py#L495

But it doesn't appear to be available on PyPI yet.

Part of why the PDB does this is because large number of chains will break the column spacing in PDB files so we'd also need to handle chain renaming (limit is 62 chains, presumably for a-zA-Z0-9). I think the most straightforward thing to do is to try to catch these errors and log a warning. The PDB maintains a list of these cases:

An index of all structure entries that do not have a standard PDB format file is updated regularly on the PDB FTP site at https://files.wwpdb.org/pub/pdb/compatible/pdb_bundle/pdb_bundle_index.txt. This list will continue to grow as new large structures are deposited and released.

so we could simply add a pdb_available column to the dataframe to filter them out.

amorehead commented 1 year ago

I agree with your approach. I think adding a pdb_available column value based on the corresponding entry in https://files.wwpdb.org/pub/pdb/compatible/pdb_bundle/pdb_bundle_index.txt should work for the first version of our logic to handle this edge case. If we go this route, do we want to filter out rows where pdb_available is False by default, or should we notify users about this edge case directly and ask them to filter out these structures themselves?

amorehead commented 1 year ago

In addition to this, I just noticed another edge case (possibly). We might also need to consider the fact that certain PDB files contain multiple models for each protein target (e.g., Model 1 or Model 3). If so, should we default to always selecting the first available model for each protein target (i.e., PDB code)?

Or, do you think we should still include all models for a given PDB code in the corresponding "collated" PDB files (i.e., those exported by pdb_manager.export_pdbs() and have users pick the models they want to use for their downstream tasks?

a-r-j commented 1 year ago

I agree with your approach. I think adding a pdb_available column value based on the corresponding entry in https://files.wwpdb.org/pub/pdb/compatible/pdb_bundle/pdb_bundle_index.txt should work for the first version of our logic to handle this edge case. If we go this route, do we want to filter out rows where pdb_available is False by default, or should we notify users about this edge case directly and ask them to filter out these structures themselves?

Hmm. I think we should raise a (descriptive!) warning whenever an export function is called (e.g. to fasta or csv) and there are unavailable PDBs in the dataframe. This could be an error for download functions. This would ensure that the state of the PDBManager is always consistent with the downloaded structures so there are no hidden surprises with missing structures downstream.

In addition to this, I just noticed another edge case (possibly). We might also need to consider the fact that certain PDB files contain multiple models for each protein target (e.g., Model 1 or Model 3). If so, should we default to always selecting the first available model for each protein target (i.e., PDB code)?

Or, do you think we should still include all models for a given PDB code in the corresponding "collated" PDB files (i.e., those exported by pdb_manager.export_pdbs() and have users pick the models they want to use for their downstream tasks?

That's a good point. It should actually be quite a straightforward modification to extract_chains_to_file in graphein.protein.utils as I added this functionality to BioPandas. How exactly users are supposed to choose between models is less clear though as none of the metadata we have collected so far makes reference to it. Selecting the first model is probably the safest (and iirc the default behaviour in graphein). At risk of growing the scope too much, we could also support downloading biological assemblies.

amorehead commented 1 year ago

@arian, thanks for your insights on these matters. In this commit, I've made the extract_chains_to_file function default to only writing out chains corresponding to a PDB's first model (i.e., model=1 with BioPandas - thanks for adding this functionality into BioPandas beforehand!). I've also tested your PDBManager tutorial notebook. I can confirm that now, as intended, I am seeing a raised exception informing users that their current PDB selection contains a protein for which the PDB file cannot be found/downloaded.

a-r-j commented 1 year ago

Amazing, thanks for that! I'll give this a once over in the morning & then I think we're ready for an initial release (unless there's anything outstanding that you want to add?)

amorehead commented 1 year ago

This plan sounds good to me. Besides adding support for structure-based splits (in future work ;)) using foldseek, I think we are good to go for a full release. We can add foldseek structural clustering/splitting later on upon request/project needs.

amorehead commented 1 year ago

One last thought before we merge this in. Maybe we should add a section in the tutorial notebook on how to specify splits for an ML dataset using the PDBManager using either time-based splitting or sequence clustering?

a-r-j commented 1 year ago

Yep, very good point. I've not rigorously explored that part of the API so I'll take care of it! :)

a-r-j commented 1 year ago

Got sidetracked with some code spring cleaning & additional utils. Apologies for making this PR into a mess :L

PDBManager things

I worked on the notebook some more but there are some cells on the temporal splitting where Jupyter hangs and dies on me. I've commented them out; any chance you could take a look?

Moved the PDBManager to graphein.ml.datasets so it can live with its friends.
Added MMTF support to download utils. This should make PDB splits much smaller & hopefully faster to download. The only limitation is writing MMTFs (extracting chains) so I worked on adding that to BioPandas this morning: https://github.com/BioPandas/biopandas/pull/119.

Misc

Add ESMFold and ESM embedding utils.
Added some utils for compressing predicted structure files into smaller files & FoldComp databases.
Refactored some utils we use for checking dependencies.

amorehead commented 1 year ago

@a-r-j, apologies for my delayed response time here. I just ran the latest version of the PDBManager tutorial notebook, and I was successfully able to perform all time-based splitting and sequence-based clustering in succession. I did not encounter any freezing issues, and (very nicely) I was able to reproduce all outputs from your most recent execution of the tutorial notebook. This suggests that our entire data fetching and splitting procedure is truly deterministic and reproducible! Great work on putting this notebook together. The only small comment I have on it is that users may also like to know that they can compose multiple types of splitting operations together. For example, they can perform a time-based split after clustering sequences into train/val/test splits without corrupting the sequence-based splits. Other than that, everything looks good to me!

a-r-j commented 1 year ago

Great! It seems it's the clustering step with mmseqs that jupyter wasn't happy with.

The only small comment I have on it is that users may also like to know that they can compose multiple types of splitting operations together. For example, they can perform a time-based split after clustering sequences into train/val/test splits without corrupting the sequence-based splits.

Good point!. Will add this.