aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
431 stars 186 forks source link

Support for `numbers.Complex` #5340

Open janssenhenning opened 2 years ago

janssenhenning commented 2 years ago

Is your feature request related to a problem? Please describe

At the moment a complex number (builtin or equivalent numpy type) cannot be used as the value in List, Dict, etc .as they are not json-serializable and cannot be stored in a stand-alone node

Would it be possible to add support for these complex number types?

Describe alternatives you've considered

Splitting the values for imaginary and real parts requires a lot of converting back and forth between float and complex

ramirezfranciscof commented 2 years ago

Hey @janssenhenning ! We are currently a bit swamped trying to finish some last details for the next release of the code, but I think this is definitively doable. I will bring it up with the team in the next meeting and if it is all good we can see how we prioritize this after the release (or you could also try making a PR yourself if that is something that would interest you).

janssenhenning commented 2 years ago

@ramirezfranciscof I would be happy to try to create a PR for this. I had a short look into this yesterday and probably somewhere the json serializer has to be customized to encode/decode complex numbers. I'm not sure whether there are multiple places or if there is just one place, where stuff going to the database passes through. One place I found so far is aiida.backends.utils.create_sqlalchemy_engine

chrisjsewell commented 2 years ago

One place I found so far is aiida.backends.utils.create_sqlalchemy_engine

Hey @janssenhenning I might hold of under #4985 is merged, given this will no longer be there lol. After that I will be happy to help. Its "easily" achieved in some respects; the key is obviously that the encode/decode should be loseless, then we should also consider if this will incur any performance costs on reading to/from the database

sphuber commented 2 years ago

One candidate would be to add this to the clean_value method that is called on all attributes and extras of a node when it is stored: https://github.com/aiidateam/aiida-core/blob/dce004143043c9f2d41c1b6f82cb037dd642a46d/aiida/orm/implementation/utils.py#L40

It already deals with numpy variants of the int and float base types and some other basic stuff. I think the numbers.Complex could also potentially qualify since it is a built in base type. Still, we should consider it carefully because that function should not become a catch all to automatically try and serialize everything and anything.

Main disadvantage from the user's perspective is also that there is no automatic deserialization for this behavior. What it would do is essentially when given:

import numbers
node = Dict({'complex': numbers.Complex(1, 2)}).store()

would transform the attributes of the node into something like:

{'complex': {'real': 1, 'imaginary': 2}}

That is what a user would see if they load the node from the database and look at its contents. There is no magic hook that would automatically deserialize this again.

Maybe this would already destroy any benefit that the automatic serialization might provide.

janssenhenning commented 2 years ago

I just checked back here and I see that the PR mentioned by @chrisjsewell is now merged.

If only the serialization would be supported via clean_value like @sphuber suggested, I agree that the benefit would be minimal to non-existent, since it still requires a lot of manual conversion back to complex numbers. I think the only way to support complex numbers consistently is to customize the used json econder/decoder by using the default function for encoding and object_hook for decoding

sphuber commented 2 years ago

I don't think there is a way to reliably automatically decode. Initially one might think that if we encounter a subdictionary of the form {'real': 1, 'imaginary': 2} we automatically convert it back to an instance of numbers.Complex, but it is possible that some data type have this kind of structure that didn't come from an automatically serialized complex number, but was added literally like this. Automatic deserialization would be undesirable in this case. So I think we should close this for now. If you can come up with an acceptable general solution, feel free to reopen.

janssenhenning commented 2 years ago

@sphuber I understand your point, the complex numbers should not mess up unrelated cases. The idea I thought of, but did not write down in this issue so far, was using the same structure as described in the docs of the stdlib json module.

import json
def as_complex(dct):
    if '__complex__' in dct:
        return complex(dct['real'], dct['imag'])
    return dct

json.loads('{"__complex__": true, "real": 1, "imag": 2}',
    object_hook=as_complex)

And the respective serialization. Since this only looks for the __complex__ key there should be less risk of wrong deserialization. The serialization could also show a warning/raise an error if it finds this key in a dict to be serialized.

Edit: I don't know why but I can't reopen the issue 😅

sphuber commented 1 year ago

Here a summary of a discussion with @chrisjsewell and @janssenhenning

It seems that the implementation is definitely feasible, it is just a question where to but the burden: on the abstract backend ORM interface of aiida-core or on the StorageBackend implementation. Currently, attributes and extras are serialized directly on the implementation, notably by the PsqlDosBackend.

Moving this to the abstract backend ORM interface would have the advantage that it automatically provides consistent behavior for all storage backend implementations. However, it also leaves less room to the implementation to optimize the behavior. That being said, the current optimization of the PsqlDosBackend is to only apply serialization for stored nodes and when storing nodes. The case where attributes are mutated many times on unstored nodes should be very rare to non-existent, so this optimization is probably not very useful.

Although it seems perfectly feasible to implements serialization/deserialization on the backend ORM interface, the question is whether conceptually this is the right thing to do. Currently there is no deserialization, and this change would impose the deserialization on all attributes and extras being fetched. It is currently not clear what the performance impact of this change would be. @sphuber will attempt to make an example implementation and provide some benchmarks.

Another concern is that already there is a disconnect between how data is returned to the user through the front-end ORM and how it is stored in the backend. This wouldn't be a problem, were it not for the fact that querying is done on the data that is stored. This means that a user should query for data in a way that is different from the way it is returned. Currently the changes are minimal since clean_value performs few changes to make data JSON serializable. But adding support for complex numbers would set us on a path to increase the discrepancy. The question is whether we should do this and start implementing our own JSON+ schema. Rather, should the responsibility for supporting additional data types in data nodes rely on the Data plugin that itself provides utility methods to serialize/deserialize data between Python and the data storage.

Still, given that AiiDA should support data science workflows and complex numbers are a common use-case, it would make sense to have integrated support in attributes and extras. So if the performance hit of automatic serialization/deserialization is acceptable, we may still want to decide to add support for it.

janssenhenning commented 1 year ago

I want to add one additional thing that I noticed when looking back over the code yesterday. In all cases where a deserialization would be inserted the return value is deepcopied. I don't know a lot about the copy module itself but from the standpoint of the complexity it should be similar to iterating through all nested dictionaries. So it might be worth moving the copy into a deserialization routine, which would already do these iterations to reduce the performance impact