a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.01k stars 126 forks source link

FoldComp ML Datasets #284

Closed a-r-j closed 1 year ago

a-r-j commented 1 year ago

Reference Issues/PRs

cc @amorehead

N/A

What does this implement/fix? Explain your changes

Adds utility to create ML datasets from FoldComp databases.

What testing did you do to verify the changes in this PR?

Local testing. Will not make unit tests due to modest download sizes.

Pull Request Checklist

review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

amorehead commented 1 year ago

@a-r-j, looks good to me so far! Do we have a list of features to aim for in a first release?

a-r-j commented 1 year ago

Yep! Off the top of my head:

Of these, 2) is the trickiest. Assuming the worst case scenario (using all of the uniprot predictions, approx 200m structures iirc), mapping graph/node labels into the Data/Protein objects would have to be done when we get a structure so they're not stored in memory (200m node label tensors seems.. prohibitive). I think the way to go is to connect this to an (optional) LMDB which could also store additional pre-computed features. Thus when we get a structure we pull in these additional data and store them in the returned Data/Protein.

FWIW, I see this functionality as complementary to the other strand of dataset creation we've been doing in #272 . Essentially, I think a model workflow looks like: make a dataset selection with a Manager -> Instantiate a FoldCompDataset -> wrap it in a LightningModule (optional).

I also saw as of 0.0.3 (today) FoldComp supports multi-chain structures. I'm not sure if this now expands support to "real" (i.e. from the PDB) PDB files, but if it does this is something to strongly consider in #272 as an export option.

a-r-j commented 1 year ago

I'm actually happy to merge this now, @amorehead.

I think the LMDB initiative will take a little effort so I'm happy to leave that until some explicit requests are made.

One request: would you be able to checkout this PR and run the notebook in lieu of proper testing? The downloads are too large for GitHub to be happy with unit testing :(

amorehead commented 1 year ago

@a-r-j, I am currently trying to import the new FoldCompDataset class you created, and I am facing the following error:

image

Even after rerunning "pip3 install -e ." inside the root graphein project directory, it seems that the FoldComp notebook is not able to locate the FoldCompDataset object path. Did you have to do anything in particular for the notebook to find this object?

a-r-j commented 1 year ago

Thanks for checking it out!

Hmm, no, I didn't have to do anything special. Could it be a kernel issue?

amorehead commented 1 year ago

I think you are right. When I tried running the notebook again today, it magically started to work - thanks Jupyter :)

amorehead commented 1 year ago

I was successfully able to run the whole notebook, and all the outputs (as best as I can tell for now) look good to me. Great (and quick) work on this. I can sense the impact of having this kind of open-source infrastructure available already!

amorehead commented 1 year ago

Can you think of anything else we need to check for in this new branch before merging it in?

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

a-r-j commented 1 year ago

Awesome, thanks for checking that out! Happy to merge now. Will migrate the LMDB discussion over to a dedicated issue.