However, this is a suboptimal code structure as ray.remote functions are intended to be defined at the module level and a similar remote function is created in the ray cluster every time an MPNN object is initialized. The intention with this block hinges on the word "similar". The core logic of the remote function that's created is the same (i.e,. it's just the remote function of mpnn.predict,) but the resources of the remote function is different, hence the definition at object initialization. There are currently some obscure runtime errors where a remote function is ejected from memory and the whole of MolPAL crashes and I suspect this block might be a culprit. Even if it's not, it's still worthwhile to change it.
New workflow
Now, we rely on the fact that we can create a parent remote function that permanently lives in the ray cluster and dynamically define children that modify the resources required to run it (because we can't know before runtime how many CPUs / GPUs the function should be run over). We change the above block to the following:
Creating a new remote function is relatively expensive, but creating a child remote function that just alters resource requirements should be fairly cheap. I would also argue that this code is slightly less obscure than above and can useful in a new convention in the codebase: annotating global remote functions with a succeeding _. I.e., if we have a function foo() and we want to run it remotely, we know then to call the function foo_.remote()
Description
This PR changes the manner in which MPNN prediction is chunked out over the
ray
clusterCurrent workflow
To distribute molecular prediction across the ray cluster, a
ray.remote
function is created frommpnn.predict
duringMPNN
initialization:However, this is a suboptimal code structure as
ray.remote
functions are intended to be defined at the module level and a similar remote function is created in theray
cluster every time anMPNN
object is initialized. The intention with this block hinges on the word "similar". The core logic of the remote function that's created is the same (i.e,. it's just the remote function ofmpnn.predict
,) but the resources of the remote function is different, hence the definition at object initialization. There are currently some obscure runtime errors where a remote function is ejected from memory and the whole of MolPAL crashes and I suspect this block might be a culprit. Even if it's not, it's still worthwhile to change it.New workflow
Now, we rely on the fact that we can create a parent remote function that permanently lives in the
ray
cluster and dynamically define children that modify the resources required to run it (because we can't know before runtime how many CPUs / GPUs the function should be run over). We change the above block to the following:Creating a new remote function is relatively expensive, but creating a child remote function that just alters resource requirements should be fairly cheap. I would also argue that this code is slightly less obscure than above and can useful in a new convention in the codebase: annotating global remote functions with a succeeding
_
. I.e., if we have a functionfoo()
and we want to run it remotely, we know then to call the functionfoo_.remote()
Checklist