Closed a-r-j closed 1 year ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
@a-r-j, looks good to me so far! Do we have a list of features to aim for in a first release?
Yep! Off the top of my head:
Dataset
s (eg accepting transforms)DataModule
wrapper for complete plug and play functionality.Of these, 2) is the trickiest. Assuming the worst case scenario (using all of the uniprot predictions, approx 200m structures iirc), mapping graph/node labels into the Data
/Protein
objects would have to be done when we get
a structure so they're not stored in memory (200m node label tensors seems.. prohibitive). I think the way to go is to connect this to an (optional) LMDB
which could also store additional pre-computed features. Thus when we get
a structure we pull in these additional data and store them in the returned Data
/Protein
.
FWIW, I see this functionality as complementary to the other strand of dataset creation we've been doing in #272 . Essentially, I think a model workflow looks like: make a dataset selection with a Manager
-> Instantiate a FoldCompDataset
-> wrap it in a LightningModule
(optional).
I also saw as of 0.0.3
(today) FoldComp supports multi-chain structures. I'm not sure if this now expands support to "real" (i.e. from the PDB) PDB files, but if it does this is something to strongly consider in #272 as an export option.
I'm actually happy to merge this now, @amorehead.
I think the LMDB initiative will take a little effort so I'm happy to leave that until some explicit requests are made.
One request: would you be able to checkout this PR and run the notebook in lieu of proper testing? The downloads are too large for GitHub to be happy with unit testing :(
@a-r-j, I am currently trying to import the new FoldCompDataset
class you created, and I am facing the following error:
Even after rerunning "pip3 install -e ." inside the root graphein
project directory, it seems that the FoldComp notebook is not able to locate the FoldCompDataset
object path. Did you have to do anything in particular for the notebook to find this object?
Thanks for checking it out!
Hmm, no, I didn't have to do anything special. Could it be a kernel issue?
I think you are right. When I tried running the notebook again today, it magically started to work - thanks Jupyter :)
I was successfully able to run the whole notebook, and all the outputs (as best as I can tell for now) look good to me. Great (and quick) work on this. I can sense the impact of having this kind of open-source infrastructure available already!
Can you think of anything else we need to check for in this new branch before merging it in?
Kudos, SonarCloud Quality Gate passed!
0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell
No Coverage information
0.0% Duplication
Awesome, thanks for checking that out! Happy to merge now. Will migrate the LMDB discussion over to a dedicated issue.
Reference Issues/PRs
cc @amorehead
N/A
What does this implement/fix? Explain your changes
Adds utility to create ML datasets from FoldComp databases.
What testing did you do to verify the changes in this PR?
Local testing. Will not make unit tests due to modest download sizes.
Pull Request Checklist
./CHANGELOG.md
file (if applicable)./graphein/tests/*
directories (if applicable)./notebooks/
(if applicable)python -m py.test tests/
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,python -m py.test tests/protein/test_graphs.py
)black .
andisort .