aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
436 stars 189 forks source link

Check that values put in the workchain context can be deserialized #5124

Open giovannipizzi opened 3 years ago

giovannipizzi commented 3 years ago

There are cases (also reported today in the mailing list) in which complex python objects manage to be serialised in YAML, but cannot be deserialised. The issue I see is that deserialisation only happens when the deamon is stopped and/or restarted, while during normal operation AiiDA does not need to deserialise but just jeeps the object in memory.

As a result, workflow developers don't realise if some object they put in the context has this issue, and months can pass (and maybe they already release and use the workchain in production) before they realise the issue, the first time they need to stop the daemon (and the workchain excepts).

I suggest that, when serialising the object, we also check that it can be deserialised. This can have a little performance hit, as this is done at every step of a workchain, but I'm quite confident that in 99% of the cases the serialisation takes fractions of ms, while the benefit for users is huge (as a user, when I will see the exception I will attribute this to a bug in AiiDA - instead, checking early and returning a clear error message is something the developers will see the first time they run the workchain).

zhubonan commented 3 years ago

I think this would be helpful for work chain developers 🚀 . Would it be possible to detect non-deserializable objects at the time of serialisation instead? I am not entirely sure how pyyaml handles this. But in theory, this would save the performance impact on checking deserialization.

Alternatively, the deserialization check can be made configurable - so if someone really wants the performance (say running hundreds of small workchains), it can be switched off.

sphuber commented 3 years ago

Would it be possible to detect non-deserializable objects at the time of serialisation instead? I am not entirely sure how pyyaml handles this. But in theory, this would save the performance impact on checking deserialization.

I doubt that there would be a way of guaranteeing that something is deserializable by any other way than actually deserializing it. Moreover, even when deserialization works in the current interpreter, this no guarantee that it would work in others as well. Taking the example that prompted the creation of this issue, the actual problem seemed to be that the module of the class could not be imported. This is hardly to do with the deserialization itself, just a weird state of the environment.

Then there is the efficiency question. It would be good to do some benchmarking to make sure we know what the hit would be, but at the very least it should be possible to turn it off. As noted, this would be mostly for people developing new workchains and punishing runtime efficiency for all other cases doesn't seem like a good idea to me. There is also the question how often this would really save someone. In all these years, this is the first reported case. Not to say there haven't been others that have gone unreported, but to me it seems a potentially quite drastic measure for something that isn't all that common and therefore not really problematic.

zhubonan commented 3 years ago

In all these years, this is the first reported case.

Well I and a few others I know have encountered this in the past, but realising that one could just make sure what gets put in the ctx is serializable, no change for the core was requested.

As noted, this would be mostly for people developing new workchains and punishing runtime efficiency for all other cases doesn't seem like a good idea to me.

This is a good point. Although the problem with this deserialisation issue can be easily overlooked as normally one would not explicitly test if their work chains work after daemon restarts, and only later get surprised in the production runs. I think a large fraction of the new users would like to write their own work chain at some point...

Maybe instead of making extract checks, we just warn about this behaviour in the documentation, it does say so here but perhaps need to elaborate what "serializable" actually means, and how to test it.

sphuber commented 3 years ago

Well I and a few others I know have encountered this in the past, but realising that one could just make sure what gets put in the ctx is serializable, no change for the core was requested.

I agree that there have been multiple occasions with problems with serialization, but here we are talking problems exclusively with deserialization, i.e. the serialization was fine. Anyway, your suggestion to improve the documentation on what serializable means and how to test it is useful to address in any case.

giovannipizzi commented 3 years ago

Wasn't there a similar problem in one of @bosonie work chains with siesta, for numpy arrays? (Just to say that maybe this is a problem either people don't notice unless they run a lot, or that they discover and fix themselves - but I still think it's valuable to detect this early). @sphuber can you hint at where the serialisation is done? And in which format the data is typically serialised? (it's just the ctx dict?) So we can do a few benchmarks and see if the performance hit is really a problem or not

sphuber commented 3 years ago

The most important points when a process is serialized is when:

The entire Process instance is serialized (including its ctx attribute and everything in it) to YAML and then stored on the checkpoint attribute on the corresponding ProcessNode.

JPchico commented 3 years ago

Hello! I just thought of giving my two cents since I recently raised this issue in the mailing list. I think updating the documentation to more clearly state what can and cannot be stored in the context would be a very good idea, maybe some examples, as sometimes it is not easy to know what is and what is not serializable.

Having something to check in the workchains could be of great use, perhaps having a debug flag or something similar, of this way the serialization/deserialization could be checked with an auxiliary function which is only activated at will, nullifying the issues with performance loss.