coleygroup / molpal

active learning for accelerated high-throughput virtual screening
MIT License
159 stars 36 forks source link

Bug/ray remotes #37

Closed degraff-roivant closed 2 years ago

degraff-roivant commented 2 years ago

Description

This PR changes the manner in which MPNN prediction is chunked out over the ray cluster

Current workflow

To distribute molecular prediction across the ray cluster, a ray.remote function is created from mpnn.predict during MPNN initialization:

if ngpu > 0:
    self.use_gpu = True
    self._predict = ray.remote(num_cpus=ncpu, num_gpus=1)(mpnn.predict)
    self.num_workers = ngpu
else:
    self.use_gpu = False
    self._predict = ray.remote(num_cpus=ncpu)(mpnn.predict)
    self.num_workers = int(ray.cluster_resources()["CPU"] // self.ncpu) 

However, this is a suboptimal code structure as ray.remote functions are intended to be defined at the module level and a similar remote function is created in the ray cluster every time an MPNN object is initialized. The intention with this block hinges on the word "similar". The core logic of the remote function that's created is the same (i.e,. it's just the remote function of mpnn.predict,) but the resources of the remote function is different, hence the definition at object initialization. There are currently some obscure runtime errors where a remote function is ejected from memory and the whole of MolPAL crashes and I suspect this block might be a culprit. Even if it's not, it's still worthwhile to change it.

New workflow

Now, we rely on the fact that we can create a parent remote function that permanently lives in the ray cluster and dynamically define children that modify the resources required to run it (because we can't know before runtime how many CPUs / GPUs the function should be run over). We change the above block to the following:

if ngpu > 0:
    self.use_gpu = True
    self._predict = mpnn.predict_.options(num_cpus=ncpu, num_gpus=1)
    self.num_workers = ngpu
else:
    self.use_gpu = False
    self._predict = mpnn.predict_.options(num_cpus=ncpu)
    self.num_workers = int(ray.cluster_resources()["CPU"] // self.ncpu) 

Creating a new remote function is relatively expensive, but creating a child remote function that just alters resource requirements should be fairly cheap. I would also argue that this code is slightly less obscure than above and can useful in a new convention in the codebase: annotating global remote functions with a succeeding _. I.e., if we have a function foo() and we want to run it remotely, we know then to call the function foo_.remote()

Checklist