Closed amorehead closed 1 year ago
Patch coverage: 39.87
% and project coverage change: +3.60
:tada:
Comparison is base (
8123f42
) 40.27% compared to head (cd4db9e
) 43.87%.
:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Wow, thanks for the refactor & additions! I added deposition dates for temporal splits.
I also had to rename the file as pdb
was causing namespace clashes with the python debugger.
@a-r-j, no problem! Happy to do so. Funny enough, earlier today, I accidentally also implemented parsing of deposition dates (should have checked my GitHub notifications beforehand). However, to minimize development style friction, I've conformed my implementation to reuse most parts of your implementation. In addition, I've also added support for time-based (i.e., deposition date) splits, as well as new PDB filters in this commit.
One practical question about SonarCloud: Any ideas for how we can get around this security issue with this HTTP URL we have listed in our code? It looks like the HTTPS version of this URL does not exist, so I am unsure of how to proceed to work around this flag.
Eep! Sorry, should have dropped a note first.
Re SonarCloud, it's not a problem; I can manually override the flag since there's no HTTPS alternative (AFAIK).
Organization of output files (e.g., PDB files) into corresponding directories for each requested dataset split (e.g., train/, val/, and test/).
On this point, I think keeping all structures in one directory and maintaining split indexing via dataframes/csv files may be a better way to go. This way you can create an arbitrary number of splits without having lots of duplicate files and a complicated directory structure.
To avoid duplicating work, I'll get cracking on some tests
Structure-based dataset splits (e.g., using TMalign to compare structures using a 50% similarity threshold).
I think I implemented TMscore a while ago (with an inbuilt assumption of equal lengths): https://github.com/a-r-j/graphein/blob/master/graphein/ml/metrics/tm_score.py without the alignment procedure.
There's also a Kabsch implementation in Graphein:
But I suppose this is not suitable as it requires correspondence between the points. I'm not super familiar with TMScore and what they use to align structures.
@a-r-j, in general, the use cases of the suite of structure comparison tools are as follows:
(1) TM-score: For comparing the Ca atom structures of two "sequence-identical" structures perfectly using the Ca-atom amino acid sequences found in each of the two input PDB files. This performs a "one-to-one" comparison of two structures, assuming the sequences are not misaligned in any way (e.g., due to inconsistent indices for residue numbers).
(2) TM-align or US-align: Computing the TM-score of two structures multiple times using different alignments of their input sequences to find an "optimal" sequence alignment before reporting the final TM-score. This is referred to as structural alignment (compared to TM-score). In general, this is negligibly slower than TM-score's computations since the program is written in C++.
In short, if you have any suspicion that your two input PDBs may inconsistently index their residue numbers or may contain missing residues in the experimental structures, it is wise to use the "-align" based programs instead for your respective molecule type.
For us, I think having a simple Python wrapper around US-align to run structural alignments and similarity calculations would be ideal, as that would allow us to use this wrapper to compare structures of not only protein pair inputs but also e.g., RNA pair inputs.
Ref:
@a-r-j, apologies for my unexpected delays on the pull request. For the time being, are you comfortable merging these changes, even though we still have the structure-based filtering left to implement?
On this bit, I'm thinking we might be able to skip using TM-score and instead rely on CATH fold classifications to determine how similar two proteins structurally are. This might be a much quicker way of comparing two structures given that all we would need to know in advance then is the CATH classification of the two input structures to make a decision for clustering/splitting. What do you think of this idea? Do you see any limitations of this approach compared to TM-score 0.5-thresholding?
In principle, yes happy to merge but I think a minimal test suite would be good. I did make a very small start on them (will push ASAP) but am a little occupied these days so don't have much capacity to build them out.
Re structural clustering, I was actually thinking we could wrap foldseek since it already appears to wrap TMAlign. Given we're already using MMSeqs, it probably makes the most sense to stay within Martin & Co's dependency universe.
@a-r-j, this all sounds good to me. What kind of format or template do you use for your unit tests, at least in this particular use case? I may be able to help in this regard (possibly, depending on schedule bandwidth in the coming weeks).
Regarding, structure-based clustering, I really like the idea of using foldseek
as well. I think this would be great timing to incorporate it here, since we now have access to a fast, streamlined way of clustering structures together. I think we should pursue a wrapper around foldseek
as our first approach here rather than going for the other ideas I put out there previously.
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
@amorehead an edge case that has popped up in the tutorial: we need a way of handling proteins that are not available as PDB files (e.g. 6T9M).
We could add this as another column in df
if we can figure out the underlying rules. I know that large structures (2700 residues iirc) are not available as PDBs. I also think there is a cutoff based on the number of chains. However, this doesn't seem to be the case for 6T9M. Do you know of any others?
@a-r-j, off the top of my head, for these large PDB structures, I believe the PDB's convention is to use alternative file formats such as mmCIF. However, if so, that would mean we would have to add full-mmCIF functionality to the PDBManager
as a secondary file type supported by the code. I am not a huge fan of this idea, since I like keeping the manager as .pdb
file friendly and focused as possible. Is it at all possible to "convert" mmCIF files into their PDB file counterparts, for example, using a Python library of some kind? If so, maybe we can approach the problem in this way?
Of course, this whole idea assumes that we have access to these mmCIF files when we find out that a certain PDB structure is represented by an mmCIF file (which we currently don't have a rule for determining in advance yet).
AFAIK conversion isn't the most reliable process. I added rudimentary capability to biopandas:
But it doesn't appear to be available on PyPI yet.
Part of why the PDB does this is because large number of chains will break the column spacing in PDB files so we'd also need to handle chain renaming (limit is 62 chains, presumably for a-zA-Z0-9
). I think the most straightforward thing to do is to try to catch these errors and log a warning. The PDB maintains a list of these cases:
An index of all structure entries that do not have a standard PDB format file is updated regularly on the PDB FTP site at https://files.wwpdb.org/pub/pdb/compatible/pdb_bundle/pdb_bundle_index.txt. This list will continue to grow as new large structures are deposited and released.
so we could simply add a pdb_available
column to the dataframe to filter them out.
I agree with your approach. I think adding a pdb_available
column value based on the corresponding entry in https://files.wwpdb.org/pub/pdb/compatible/pdb_bundle/pdb_bundle_index.txt
should work for the first version of our logic to handle this edge case. If we go this route, do we want to filter out rows where pdb_available
is False by default, or should we notify users about this edge case directly and ask them to filter out these structures themselves?
In addition to this, I just noticed another edge case (possibly). We might also need to consider the fact that certain PDB files contain multiple models
for each protein target (e.g., Model 1 or Model 3). If so, should we default to always selecting the first available model for each protein target (i.e., PDB code)?
Or, do you think we should still include all models for a given PDB code in the corresponding "collated" PDB files (i.e., those exported by pdb_manager.export_pdbs()
and have users pick the models they want to use for their downstream tasks?
I agree with your approach. I think adding a pdb_available column value based on the corresponding entry in https://files.wwpdb.org/pub/pdb/compatible/pdb_bundle/pdb_bundle_index.txt should work for the first version of our logic to handle this edge case. If we go this route, do we want to filter out rows where pdb_available is False by default, or should we notify users about this edge case directly and ask them to filter out these structures themselves?
Hmm. I think we should raise a (descriptive!) warning whenever an export function is called (e.g. to fasta or csv) and there are unavailable PDBs in the dataframe. This could be an error for download functions. This would ensure that the state of the PDBManager is always consistent with the downloaded structures so there are no hidden surprises with missing structures downstream.
In addition to this, I just noticed another edge case (possibly). We might also need to consider the fact that certain PDB files contain multiple
models
for each protein target (e.g., Model 1 or Model 3). If so, should we default to always selecting the first available model for each protein target (i.e., PDB code)?Or, do you think we should still include all models for a given PDB code in the corresponding "collated" PDB files (i.e., those exported by
pdb_manager.export_pdbs()
and have users pick the models they want to use for their downstream tasks?
That's a good point. It should actually be quite a straightforward modification to extract_chains_to_file
in graphein.protein.utils
as I added this functionality to BioPandas. How exactly users are supposed to choose between models is less clear though as none of the metadata we have collected so far makes reference to it. Selecting the first model is probably the safest (and iirc the default behaviour in graphein). At risk of growing the scope too much, we could also support downloading biological assemblies.
@arian, thanks for your insights on these matters. In this commit, I've made the extract_chains_to_file
function default to only writing out chains corresponding to a PDB's first model (i.e., model=1
with BioPandas - thanks for adding this functionality into BioPandas beforehand!). I've also tested your PDBManager tutorial notebook. I can confirm that now, as intended, I am seeing a raised exception informing users that their current PDB selection contains a protein for which the PDB file cannot be found/downloaded.
Amazing, thanks for that! I'll give this a once over in the morning & then I think we're ready for an initial release (unless there's anything outstanding that you want to add?)
This plan sounds good to me. Besides adding support for structure-based splits (in future work ;)) using foldseek
, I think we are good to go for a full release. We can add foldseek
structural clustering/splitting later on upon request/project needs.
One last thought before we merge this in. Maybe we should add a section in the tutorial notebook on how to specify splits
for an ML dataset using the PDBManager
using either time-based splitting or sequence clustering?
Yep, very good point. I've not rigorously explored that part of the API so I'll take care of it! :)
Got sidetracked with some code spring cleaning & additional utils. Apologies for making this PR into a mess :L
I worked on the notebook some more but there are some cells on the temporal splitting where Jupyter hangs and dies on me. I've commented them out; any chance you could take a look?
PDBManager
to graphein.ml.datasets
so it can live with its friends.@a-r-j, apologies for my delayed response time here. I just ran the latest version of the PDBManager tutorial notebook, and I was successfully able to perform all time-based splitting and sequence-based clustering in succession. I did not encounter any freezing issues, and (very nicely) I was able to reproduce all outputs from your most recent execution of the tutorial notebook. This suggests that our entire data fetching and splitting procedure is truly deterministic and reproducible! Great work on putting this notebook together. The only small comment I have on it is that users may also like to know that they can compose multiple types of splitting operations together. For example, they can perform a time-based split after clustering sequences into train/val/test splits without corrupting the sequence-based splits. Other than that, everything looks good to me!
Great! It seems it's the clustering step with mmseqs that jupyter wasn't happy with.
The only small comment I have on it is that users may also like to know that they can compose multiple types of splitting operations together. For example, they can perform a time-based split after clustering sequences into train/val/test splits without corrupting the sequence-based splits.
Good point!. Will add this.
Kudos, SonarCloud Quality Gate passed!
0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell
No Coverage information
0.0% Duplication
Reference Issues/PRs
https://github.com/a-r-j/graphein/discussions/270 @a-r-j
What does this implement/fix? Explain your changes
Adds a utility for creating selections of experimental PDB structures
What testing did you do to verify the changes in this PR?
WIP
Draws the following metadata:
Currently missing:
train
,val
, andtest
).Pull Request Checklist
./CHANGELOG.md
file (if applicable)./graphein/tests/*
directories (if applicable)./notebooks/
(if applicable)python -m py.test tests/
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,python -m py.test tests/protein/test_graphs.py
)black .
andisort .