OpenFreeEnergy / gufe

grand unified free energy by OpenFE
https://gufe.readthedocs.io
MIT License
28 stars 8 forks source link

[Question] ProtocolUnit key stability #245

Closed LilDojd closed 11 months ago

LilDojd commented 11 months ago

ProtocolUnit overwrites _gufe_tokenize of GufeTokenizable:

https://github.com/OpenFreeEnergy/gufe/blob/f87e2363326bf62deca2b8a4a6105f3f563fdd1e/gufe/protocols/protocolunit.py#L96-L98

This leads to different keys being assigned each time the "same" ProtocolUnit is created:

class DummyProtocolUnit(ProtocolUnit):

    def __init__(self, some_value):
        super().__init__()

        self.some_value = some_value
    def _execute(self, inp):
        return inp + self.some_value

du = DummyProtocolUnit(1)
same_du = DummyProtocolUnit(1)

assert du.key == same_du.key

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
 Cell 2 line 1
     [10] du = DummyProtocolUnit(1)
     [11] same_du = DummyProtocolUnit(1)
---> [13] assert du.key == same_du.key

And TOKENIZABLE_REGISTRY has weakrefs to both instances:

len(list(TOKENIZABLE_REGISTRY.values()))
>>> 2

This affects not only the unit but all the downstream classes with dependencies from ProtocolUnits: ProtocolDAG, ProtocolUnitResult, etc. Isn't this a bit counterintuitive, and the default behavior should be the same as for all other GufeTokenizables? – If two instances of a class are created with the same _to_dict representation, they both should be assigned the same key, and only one instance should exist for that unit in the registry?

class StableDummyProtocolUnit(ProtocolUnit):

    def __init__(self, some_value):
        super().__init__()

        self.some_value = some_value
    def _execute(self, inp):
        return inp + self.some_value

    def _gufe_tokenize(self):
        return self.__class__.__base__.__base__._gufe_tokenize(self) # This is equivalent to GugeTokenizable._gufe_tokenize(self)

assert StableDummyProtocolUnit(1).key == StableDummyProtocolUnit(1).key

One possible developer story that advocates for stable ProtocolUnit keys is trying to implement a checkpoint system that ensures that if the user has a failed task, the next time it is retried from the last state (last successful ProtocolUnit), which is identified by a unique key.

I can not think of a user story that requires that each new instance of a ProtocolUnit has a different key. Am I missing something?

Would love to hear your thoughts on this!

mikemhenry commented 11 months ago

@LilDojd This is a great question! I will raise this on our internal issue tracker to make sure it gets a response.

dwhswenson commented 11 months ago

Thanks for the feedback! This is intentional. The essential point is that these keys are used to label stored results (for example, there may be a directory that contains the files generated by the unit). The user story we had in mind is that after running a set of simulations once, the user wants to do another independent repeat of the full DAG (maybe to improve error bars). This should still come from the same Transformation (for analysis purposes) but needs a distinct label for storage purposes.

Since the directory where the files are stored needs to be known before the ProtocolUnitResult is generated, it was more natural to use the ProtocolUnit's key for this. (Results really must have distinct keys, even if units do not, since if you run a unit twice you would get different results.)

The way we had planned to handle the checkpointing use case you mention involves two components:

  1. We expect execution engines to capture failure results, eventually including in the case of being killed by a signal (e.g., hitting wall time in a queueing system.) This means that you should always have a ProtocolDAGResult, even if a unit failed.
  2. The previous failed result can be passed as the extends parameter for Protocol.create. It is up to the protocol author to implement recovery from an incomplete result. The DAG that gets created can pick up from the partially complete DAG.

So far, the protocols we've implemented don't have this kind of checkpointing. However, we have definitely tried to ensure that a solution for it is possible!

Thanks again for the questions, and I hope this answers them.

LilDojd commented 11 months ago

Thank you! That clarifies a lot