Usability: Provide generic mechanism to serialize and deserialize data from and to AiiDA data

Motivation

This user-story is slightly more technical then it should be, but there is no real other way of phrasing it. Essentially it boils down to the following:

One of AiiDA's main functions for a user is to store large quantities of data. While the Python API provides many tools to interact with and manipulate this data, sooner or later the data will have to leave AiiDA. Conversely, to start working with AiiDA, data will have to be ingested. In other words, data stored in AiiDA will have to be serialized into a certain format when leaving its database, and data has to be deserialized from a certain format when it is ingested.

Currently, there are already two major tools that implement such a (de)serialization:

The archiving system
The REST API (both the version that ships with aiida-core and the aiida-restapi package)

The REST API uses JSON to (de)serialize data but it implements custom translators to do so. This is the core problem: every tool currently has to implement their own code to (de)serialize data since the Python ORM cannot be used. Moreover, there is no single mechanism to determine the "schema" of a piece of data, so it has to be hardcoded.

Desired Outcome

Ideally, AiiDA would provide a generic mechanism to serialize and deserialize any data that can be stored within its database. This would essentially require each ORM type to define a schema of its data structure that can be requested by a client of the API. This would allow external applications to write utilities to reliably extract data from AiiDA or store data within it. The key here is that it should not be necessary to write custom serializers for plugins, but that they are automatically supported through the general formalism.

Impact

A successful solution will touch many other use-cases, such as already mentioned the REST and web APIs (for example see #16).

Complexity

In principle, all data in AiiDA is stored either as JSON-serializable data (in the PostgreSQL database) or as binary blobs (in the file repository). The simplest approach then would be to have a JSON-extended format that includes support for binary blobs. But the exact serialization format is not the real problem, other solutions could be used. The main question is how to have all data in AiiDA define a schema. We could do this for the ORM entities that are shipped with aiida-core but the tricky part is that this should also work for plugins, such as Data subclasses. A solution should be generic and work regardless of any custom plugins that are installed.

The real difficulty is that, as it is implemented currently, the interface of the Data class, and especially the way they are constructed, allow the use of arbitrary types. For example the StructureData allows its constructions through the pymatgen.core.Structure or ase.Atoms types. These are typically not generically serializable. It might be necessary to change the interface of Data to force it to declare statically its data schema and allow construction of an instance through from serialized data without requiring Python types. Unfortunately, this change would almost certainly require backwards incompatible changes to the Data interface.

Background

This issue has already been discussed in aiida-core, see this issue where it is being tracked. No concrete advances have been made yet.

Progress

So far, no concrete progress has been made in addressing this problem.

For example the StructureData allows its constructions through the pymatgen.core.Structure or ase.Atoms types. These are typically not generically serializable

Can you elaborate on why that is a problem for the (de)serialization of the Data class?

It's not like Data ends up directly storing pymatgen.core.Structure / ase.Atoms instances, so as long as our own Data instances are (de)serializable, I think it's fine to have ways of constructing it from non-serializable data types.

In practical terms, I think the part below is what has posed difficulty (in particular in the implementation of aiida-restapi):

In principle, all data in AiiDA is stored either as JSON-serializable data (in the PostgreSQL database) or as binary blobs (in the file repository). The simplest approach then would be to have a JSON-extended format that includes support for binary blobs.

Supporting a set of JSON fields per entity is straightforward in most tools (pydantic for REST API / graphene for GraphQL), but downloading and uploading files typically requires custom solutions.

I'm not sure dumping the blobs from the file repository into a JSON response is really the way to go here... if we really believe that is the case, does this mean we should get rid of the distinction between database and file repository altogether?

I could imagine a JSON serialization format that includes pointers to binary blobs, and just have a generic "get blob" interface.

It would be great to spend some thought on this question.

Can you elaborate on why that is a problem for the (de)serialization of the Data class?

It's not like Data ends up directly storing pymatgen.core.Structure / ase.Atoms instances, so as long as our own Data instances are (de)serializable, I think it's fine to have ways of constructing it from non-serializable data types.

It is not a problem per sé, as long as it just one of the ways of constructing an instance. The current problem is that it is the only way of doing things. Given the fact that on top of that, there is no way to introspect what and how serialized data is stored (since there is no schema) third-party applications will have to hard-code mappings for data plugins. What we want is a design where, for any entity, the data schema can be discovered dynamically such that data can be serialized and deserialized.

This is definitely possible in principle, just not in a backwards-compatible way, for, at the very least, the Data plugins.

I'm not sure dumping the blobs from the file repository into a JSON response is really the way to go here...

Honestly, the exact specific of the serialization format are not even that important, as long as there is one. I am not necessarily advocating for JSON with integrated blobs, just saying that these are the principal data types, so anything that could support both, could be a potential candidate.

if we really believe that is the case, does this mean we should get rid of the distinction between database and file repository altogether?

Even if we have a shared serialization format to export data out of AiiDA, that does not mean that the exact same format has to be used in the storage. Of course, if for efficiency reasons one would want that, it is already possible to implement a StorageBackend that just has a database and stores everything there. But this is not the discussion here.

aiidateam / team-compass