FoldComp ML Datasets - Githubissues

a-r-j commented 1 year ago

Reference Issues/PRs

cc @amorehead

N/A

What does this implement/fix? Explain your changes

Adds utility to create ML datasets from FoldComp databases.

What testing did you do to verify the changes in this PR?

Local testing. Will not make unit tests due to modest download sizes.

Pull Request Checklist

[x] Added a note about the modification or contribution to the ./CHANGELOG.md file (if applicable)
[ ] Added appropriate unit test functions in the ./graphein/tests/* directories (if applicable)
[x] Modify documentation in the corresponding Jupyter Notebook under ./notebooks/ (if applicable)
[ ] Ran python -m py.test tests/ and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., python -m py.test tests/protein/test_graphs.py)
[x] Checked for style issues by running black . and isort .

review-notebook-app[bot] commented 1 year ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

amorehead commented 1 year ago

@a-r-j, looks good to me so far! Do we have a list of features to aim for in a first release?

a-r-j commented 1 year ago

Yep! Off the top of my head:

[x] Couple of changes to fully align the API with torch geometric Datasets (eg accepting transforms)
[ ] A utility for adding graph/node labels & metadata
[x] A lightning DataModule wrapper for complete plug and play functionality.

Of these, 2) is the trickiest. Assuming the worst case scenario (using all of the uniprot predictions, approx 200m structures iirc), mapping graph/node labels into the Data/Protein objects would have to be done when we get a structure so they're not stored in memory (200m node label tensors seems.. prohibitive). I think the way to go is to connect this to an (optional) LMDB which could also store additional pre-computed features. Thus when we get a structure we pull in these additional data and store them in the returned Data/Protein.

FWIW, I see this functionality as complementary to the other strand of dataset creation we've been doing in #272 . Essentially, I think a model workflow looks like: make a dataset selection with a Manager -> Instantiate a FoldCompDataset -> wrap it in a LightningModule (optional).

I also saw as of 0.0.3 (today) FoldComp supports multi-chain structures. I'm not sure if this now expands support to "real" (i.e. from the PDB) PDB files, but if it does this is something to strongly consider in #272 as an export option.

a-r-j commented 1 year ago

I'm actually happy to merge this now, @amorehead.

I think the LMDB initiative will take a little effort so I'm happy to leave that until some explicit requests are made.

One request: would you be able to checkout this PR and run the notebook in lieu of proper testing? The downloads are too large for GitHub to be happy with unit testing :(

amorehead commented 1 year ago

@a-r-j, I am currently trying to import the new FoldCompDataset class you created, and I am facing the following error:

Even after rerunning "pip3 install -e ." inside the root graphein project directory, it seems that the FoldComp notebook is not able to locate the FoldCompDataset object path. Did you have to do anything in particular for the notebook to find this object?

a-r-j commented 1 year ago

Thanks for checking it out!

Hmm, no, I didn't have to do anything special. Could it be a kernel issue?

amorehead commented 1 year ago

I think you are right. When I tried running the notebook again today, it magically started to work - thanks Jupyter :)

amorehead commented 1 year ago

I was successfully able to run the whole notebook, and all the outputs (as best as I can tell for now) look good to me. Great (and quick) work on this. I can sense the impact of having this kind of open-source infrastructure available already!

amorehead commented 1 year ago

Can you think of anything else we need to check for in this new branch before merging it in?