AstraZeneca / chemicalx

A PyTorch and TorchDrug based deep learning library for drug pair scoring. (KDD 2022)

https://chemicalx.readthedocs.io

Apache License 2.0

715 stars 87 forks source link

Add multi-GPU support with `accelerate` #76

Closed GavEdwards closed 2 years ago

GavEdwards commented 2 years ago

Summary

Enable GPU support (+more) via the Accelerate library.

This is still work in progress - there's still some bugs to be ironed out around multi-gpu and some models.

TODO:

[ ] add documentation for accelerate
[ ] multi-gpu tests
[ ] test with all models

[ ] Unit tests provided for these changes
[ ] Documentation and docstrings added for these changes using the sphinx style

Changes

Add Accelerate as a dependency
Adjust the pipeline code to use accelerate

benedekrozemberczki commented 2 years ago

Packedgraphs do not have a .to() method, but a custom .cuda() method. Pretty messed up.

https://github.com/DeepGraphLearning/torchdrug/blob/master/torchdrug/data/graph.py

GavEdwards commented 2 years ago

Examples: https://github.com/huggingface/accelerate/tree/main/examples

cthoyt commented 2 years ago

So one main thing about this PR is that I don't think it's necessary to use Accelerate to add GPU support - we only have to more cleverly keep track of devices. I'm not against using Accelerate since it has some nice add-ons for multi-gpu etc. but I wouldn't depend on it to solve the original problem

Perhaps we can monkey patch a to() into the packed graph class

GavEdwards commented 2 years ago

So one main thing about this PR is that I don't think it's necessary to use Accelerate to add GPU support - we only have to more cleverly keep track of devices. I'm not against using Accelerate since it has some nice add-ons for multi-gpu etc. but I wouldn't depend on it to solve the original problem

Perhaps we can monkey patch a to() into the packed graph class

Hi @cthoyt 😄 My thinking was this approach avoids us having to re-invent the wheel and quickly solves the current gpu need. The original issue doesn't describe requirements to aim for #65.

Two questions:

In your opinion, what parts of the original problem doesn't Accelerate solve?
What requirements need to be met for this approach be accepted?

I'm not attached to Accelerate as a solution either, just trying to understand limits & needs

codecov-commenter commented 2 years ago

Codecov Report

Merging #76 (0542ab9) into main (20f0e85) will decrease coverage by 0.65%. The diff coverage is 54.16%.

@@            Coverage Diff             @@
##             main      #76      +/-   ##
==========================================
- Coverage   94.65%   94.00%   -0.66%     
==========================================
  Files          34       34              
  Lines        1478     1500      +22     
==========================================
+ Hits         1399     1410      +11     
- Misses         79       90      +11

Impacted Files	Coverage Δ
chemicalx/pipeline.py	`80.19% <54.16%> (-8.41%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 20f0e85...0542ab9. Read the comment docs.

cthoyt commented 2 years ago

I'm saying that implementing GPU usability and implementing accelerate are two independent things, and I would rather see a PR that first enables GPU usage explicitly without accelerate to make sure we don't make any accelerate-specific mistakes

cthoyt commented 2 years ago

Additionally I've opened a PR on torchdrug to solve the problem upstream, which will be much more elegant than us hacking it in: https://github.com/DeepGraphLearning/torchdrug/pull/70. In the meantime, we could provide a compat module where we subclass PackedGraph with this function built-in and refer to that class throughout chemicalx

cthoyt commented 2 years ago

@GavEdwards please note we've already merged a simple solution into the main branch and now updated your PR with it, please check it out