aiidateam / aiida-atomistic

AiiDA plugin which contains data and methods for atomistic simulations.
MIT License
1 stars 5 forks source link

Immutability and usability #9

Open mikibonacci opened 2 months ago

mikibonacci commented 2 months ago

As design principle, we want to have the StructureData as immutable even if it is not stored. This is done in order to avoid the user to attempt to modify the stored node afterwards, as one usually would do in standard python.

However, this blocks also flexibility on how we produce our StructureData instance as ready to be used in calculation. For example, it will be no more possible to use methods like append_atom() for the unstored node, as well as the append_hubbard_parameter in case of the Hubbard property.

To change/update the instance, the only possibility is to generate a new instance, and this is made easier by having the to_dict method, which returns a python dictionary with everything we need to initialise a new object. However, I think that then the user has to deal with a rather complex object:

properties = {'cell': {'value': [[3.5, 0.0, 0.0], [0.0, 3.5, 0.0], [0.0, 0.0, 3.5]]},
 'pbc': {'value': [True, True, True]},
 'positions': {'value': [[0.0, 0.0, 0.0], [1.5, 1.5, 1.5]]},
 'symbols': {'value': ['Li', 'Li']},
 'mass': {'value': [6.941, 6.941]},
 'charge': {'value': [1.0, 0.0]},
 'kinds': {'value': ['Li0', 'Li1']}}

(we can always dump this as an AttributeDict).

The main difficulty here is that, to change a property, we have to go into this nested structure every time. Moreover, there are not method to change multiple properties at the same time (e.g., a python dictionary has no remove_atom method) - the user should implements its own function to do it.

I am thinking that maybe returning a different object, can make the modification easier for the user.

For example, we can think to return an instance of the Properties class (property attribute of the structure node), which actually contains all the information on the StructureData node. This contains the instances of the properties as attributes, so we can call methods of the properties to modify them if we need it.

In [1]: structure.properties.charge
Out[1]: Charge(parent=<StructureData: uuid: 4ccb3961-e6ff-4b32-8dcf-a638b765981c (unstored)>, domain='intra-site', default_kind_threshold=0.1, value=[1.0, 0.0])

The modified Properties instance can be used to initialise a new StructureData (consistency checks are done in the __init__). One issue that I see is that right now the properties are connected to the StructureData instance. We can think on making the properties mutable only if they are not attached to any StructureData instance, i.e. if are only attributes of the Properties instance:

In [2]: properties = structure.to_mutable() 
In [3]: properties.charge
Out[3]: Charge(domain='intra-site', default_kind_threshold=0.1, value=[1.0, 0.0])

we can then even think to provide methods to append/remove atoms and properties:

In [4]: properties.append_atom(positions=[0,0,0], symbol="Li", mass=6.941, charge=-1)
In [5]: properties.charge.append_charge(charge=-1, atom_index=1)

It seems that in this way we are providing a new class, which is not AiiDA datatype. It seems that we have then to maintain two classes: actually we are somehow transferring all the modification methods (append, remove atoms) to the Property class.

I am really not sure this is the way to go, it is just an idea on how we can try to improve user experience.

In the end we can somehow achieve a similar thing (limited wrt the properties which are supported), by using ASE or Pymatgen combined with StructureData.from/to_ase/pymatgen.

mbercx commented 2 months ago

Thanks for the write up, @mikibonacci! I think in the end the position of not having any methods that mutate the structure or its properties will not be maintainable (even though I semi-argued for it in this discussion). Two arguments are:

  1. Moving to having all nodes immutable is so backwards incompatible it's almost a pipe dream. The Dict node is mutable, and changing that would break 95% of the code base (rough estimate ^^).
  2. How would you typically obtain a StructureData? I for one would use either ASE Atoms or pymatgen StructureData, creating one of those instances from a CIF and then converting it into a StructureData. If at that point the StructureData is completely immutable, I can't add any more properties. So the from_pymatgen, from_ase class methods will not nearly be as useful.
mikibonacci commented 1 month ago

After useful discussions with @GeigerJ2 and @khsrali, we think that a StructureData unstored node should be mutable (as currently is) before being stored. The main reason is simple: the user wants to modify a StructureData unstored node and expects to be able to do it. Also, point 2 of the previous comment by @mbercx is a strong argument against immutability of StructureData unstored nodes.

I think, the user will complain more about the immutability of an unstored node, with respect to him doing something wrong in modifying a StructureData (e.g. appending wrong atoms). Of course we can put validation checks (I mean, we extend the _validate method of the StructureData).

In conclusion, I think we should let the user change unstored StructureData nodes. We just document the rule that, after being stored, the node will be immutable. Of course this is already in the AiiDA documentation , however @GeigerJ2 was suggesting to have like a concise but complete page where we collect all these AiiDA rules, that sounds as a very good idea.

Pinning @sphuber and @giovannipizzi for useful comments on this.

sphuber commented 1 month ago

As mentioned during the group meeting, it would be great to get some concrete code snippets that show were a mutable class would be significantly better compared to an immutable variant. We have had many discussions concerning this issue over the last ten years or so as a result of confusion arising from having data classes be mutable when unstored. I think that merits having strong evidence on the other side to counteract this.

GeigerJ2 commented 1 month ago

In principle, now that I think about it, I don't see much speaking against making StructureData immutable, actually. Though, the methods now used to modify it in place (e.g. append_atom) shouldn't return the complex properties dictionary, as mentioned in @mikibonacci's original write-up, but instead a new, unstored StructureData instance (in essence, the string edit behavior mentioned by @mbercx). This should also be easy to implement, internally obtaining the dict/serializable representation of the original StructureData node, updating it with the data provided through the method, and passing it again to the constructor. So, rather than:

sd_instance.append_atom(...)
sd_instance.initialize_onsites_hubbard(...)

doing instead (similar to pandas dataframes):

sd_instance = sd_instance.append_atom(...)
sd_instance = sd_instance.initialize_onsites_hubbard(...)

seems like a very acceptable change to me. When preparing some structure, I would not care about the intermediate nodes. Just the final one that I pass as an input to the process, which becomes stored anyway, and doesn't have to be the same one as I started out with.

On the other hand, as noted by @mbercx:

  1. Moving to having all nodes immutable is so backwards incompatible it's almost a pipe dream. The Dict node is mutable, and changing that would break 95% of the code base (rough estimate ^^).

it's not feasible to be consistent for all AiiDA classes anyway. Then is it worth to apply this behavior only for the StructureData now? We remove the confusion of the different behavior of an unstored or stored StructureData node, but we add the confusion of some AiiDA data classes always being immutable, while some others remain mutable (if unstored). If we deem it acceptable, we should list clearly, which classes behave in which way (similar to the overview of the behavior of the different Python base classes that every beginner is confronted with).

I'll go ahead and implement an immutable version of the current new StructureData we have been working on. Should be quick (at least for a reduced set of methods).

sphuber commented 1 month ago

Why do we even need this append_atom API? I think this is completely unnecessary. Why can't a user just define their atoms and pass them in one go? Why even have this incremental creation to begin with. The fact that the current StructureData has it should not be an argument in any case, I would say

GeigerJ2 commented 1 month ago

Alright, fair point. I think this ties in with the general question: If StructureData is always immutable, should we provide any method that seems like it modifies it? In principle, it's true that users can always just define their ase.Atoms, pymatgen.Structure, or dict/serialized object beforehand, and then pass it to one of the various constructors of StructureData, so we don't really need it.

However, I'm thinking now of a use case that was very common in my PhD: Say, one wants to explore adsorption of hydrogen on a surface slab. Without having a full, dedicated AiiDA workflow for the whole procedure, one would first optimize the pristine slab, e.g. using one of the RestartWorkChains from aiida-quantumespresso/vasp. From this, one obtains the optimized structure as StructureData. In such calculations, it's common practice to keep the bottom layers fixed at the bulk positions, so the user initially set the property positions_fixed, of the StructureData instance, which is a list of bools (or a list of length 3 tuples of bools to specify movement restrictions for x, y, and z, individually). While it is chosen here as an example for the sake of the example, it is an actual property we want to add, but one could also think of on-site / inter-site Hubbard, or other site-based properties.

Now, from this optimized slab, one wants to create an input StructureData for exploring hydrogen adsorption. If StructureData provides an append_atom (or append_site) method that allows one to "modify" the initial StructureData instance in a simple way (even if it is immutable and a new instance is returned), it would be very simple for the user to add a hydrogen atom on top by passing the required information. If this is not possible, one would have to first convert it to ase or pymatgen, then add the atom, then convert it back. While still acceptable, a bit annoying. The more general case is that any calculation run via AiiDA returns StructureData, so for the user to be able to "just define their atoms and pass them in one go" would always require these two conversion steps for structural modifications then.

There is also the general question of how to conserve (pass through) the positions_fixed (or Hubbard) property during this procedure? ase.Atoms doesn't have this property defined (in the same manner), so what do we do with that when converting to it? And how do we set it again when converting again to StructureData? (ase actually has a FixAtoms class to take care of that)

Having an append_atom (or append_site) doesn't solve this issue magically, we would still need to add logic and validation to make sure that the properties are consistent and preserved throughout (e.g. raise an exception if one wants to append an ase.Atoms object to a StructureData if the latter has positions_fixed defined, while the former does not). As it's not a super straightforward issue, we could make our life much easier by leaving this responsibility to the user, but in the spirit of user-friendliness, I think it would be very nice to provide such convenience methods. Especially if it's moved out of aiida-core and into aiida-atomistic, we could take care of such implementation details (like ase using the FixAtoms class), and hide that complexity away from the AiiDA user to make the experience as seamless as possible.

When starting from scratch, I'd also never generate my StructureData in this incremental route, but usually make the edits outside and then use one of the alternative constructors from_ase/pymatgen/file. But for the reasons above, I'd still keep it as a convenience method for the user. Hope this was somewhat understandable, and curios to hear your thoughts!

khsrali commented 1 month ago

@sphuber Maybe I'm missing something, but honestly, I don't understand why it hurts to have StructureData mutable only before storage? That doesn't break any provenance or whatsoever; it's just in the initialization phase. It's really convenient and practical: importing a half structure from a file, adding a few hydrogen atoms here and there, and adding this ase atom that I want to add, all of which are really convenient. Once the structure is built up, store it, and make it immutable, it's alright. Why should users be limited by not having such streamlined code behavior, Is there a specific reason?

sphuber commented 1 month ago

That doesn't break any provenance or whatsoever;

The objection has nothing to do with the preservation of provenance.

Why should users be limited by not having such streamlined code behavior, Is there a specific reason?

The reason is exactly to ensure the interface is as user-friendly as possible and has as little surprising behavior as possible. Experience over the last decade has learned that having mutating methods on a data class often leads to suprising behavior for users and can easily lead to bugs. Note that this is specific to AiiDA, compared to other tools like ASE and pymatgen, exactly because AiiDA has this concept of immutability once stored.

Take the append_atom method as an example. When the node is unstored, it would look something like:

structure = StructureData()
structure.append_atom(...)
structure.append_atom(...)

Now the user loads an existing structure from the database and wants to add some atoms as per @GeigerJ2 's example:

structure = load_node(...)
structure.append_atom(...)
structure.append_atom(...)

only to find that his structure has not been updated at all. Let the debugging commence! Until they figure out that when the structure is stored then all of a sudden append_atom returns a new StructureData instance. This is because the starting node is stored and immutable. So now they would have to change the code to

structure = load_node(...)
new_structure = structure.append_atom(...)
new_structure.append_atom(...)

Of course the new_structure is probably not stored, so in the second append_atom we don't have to assign the return value but we can mutate new_structure.

This is just one example of course, but the more mutating methods you add, the more edge-cases there will be. I admit that there will not be a perfect solution and there will be trade-offs. I just wanted to make sure we don't make the same mistake again without having really thought hard about it. The relatively short comment stating that you guys had decided that a mutable variant was better, with little supporting evidence of concrete examples, made me worried that the historic counter arguments were perhaps not sufficiently taken into account. And, with all due respect, the fact that you seemed to not even be aware of the main counter argument seems to lend credence to that suspicion. This is the reason why I asked for some concrete examples so we can evaluate the pros/cons of both solutions and actually compare.

Finally, the argument

Moving to having all nodes immutable is so backwards incompatible it's almost a pipe dream. The Dict node is mutable, and changing that would break 95% of the code base (rough estimate ^^).

I don't find very convincing to be honest. The whole reason we are redesigning the StructureData is because we know that there are flaws, and moving it to aiida-atomistic is our one and only chance to get it right. If we think the immutable variant is better, we should implement that and actually aim to do the same for the other built-ins in aiida-core for v3.

khsrali commented 1 month ago

I appreciate your explanation; this is a good discussion. Of course, we, or at least I, have not been involved in this discussion since 10 years ago. Indeed, one can trace back everything on github, but that would take quite some time if one had to do that for everything. Anyway, your reasons are helping me think this through. I don't believe any decision has been made final yet; the idea was to bring this up and discuss it.

In the example you are raising, I would still find it user-friendly to do:

structure = load_node(..., copy=True) # returns unstored copy
structure.append_atom(..)

I'll wait for Miki's example, as you requested. So we can continue discussion on pros/cons based on that.

mikibonacci commented 1 month ago

First thing to say, I agree that returning a new modified instance of the node every time we invoke the append_atom method (or others) will give issues. Thanks @sphuber for having underlined this.

Second thing: trying to append_atom on a stored node should raise an exception like this:

ModificationNotAllowed: ...the node is already stored. If you want to modify it, you should copy it via the node.clone() method, which returns an unstored copy of the node. 

This means that the user will know how to proceed (i.e. cloning the node).

Apology for the long comment below. The summary is: immutability has great advantage of being error-safe, but in some cases makes the user-life more difficult. For StructureData this may require then to support a new python class, mutable, which has with StructureData a 1-to-1 correspondence (similar to Dict<->dict, with all the issues of the case), and that can contain useful modification methods. This is basically my proposal: immutable StructureData but with a python (non-AiiDA) counterpart.

(1) Simple example: selective dynamics, starting from ASE Atoms object

Let's first consider the following initial conditions:

  1. we want to run relaxation/MD, so we should consider the simple case of selective_dynamics, where each site can be then kept fixed at the starting position during the simulation. Let's suppose to have, in our new StructureData, the possibility to define, for a given site, the selective_dynamics property (or whatever name)

  2. we start from an ASE Atoms object (because I like its flexibility: we can build slabs, append molecules on top, and so on), which however does not contain any selective_dynamics property (or its analogous)

from ase import Atoms
ase_atoms = Atoms('N2', [(0, 0, 0), (1, 0, 0)])
ase_atoms.cell = [2,2,2]
ase_atoms.set_initial_charges([0,0.5])

from aiida_atomistic.data.structure import StructureData
structure = StructureData.from_ase(ase_atoms)
structure.attributes

The attributes of the structure are:

{'cell': [[2.0, 0.0, 0.0], [0.0, 2.0, 0.0], [0.0, 0.0, 2.0]],
 'pbc1': False,
 'pbc2': False,
 'pbc3': False,
 'sites': [
        {'symbol': 'N',
           'mass': 14.007,
           'kind_name': 'N',
           'position': (0.0, 0.0, 0.0),
           'charge': 0.0,
           'magnetization': 0.0},
        {'symbol': 'N',
         'mass': 14.007,
         'kind_name': 'N',
         'position': (1.0, 0.0, 0.0),
         'charge': 0.5,
         'magnetization': 0.0},
      ],
}

Which is exactly the set of inputs that we should provide in the StructureData constructor if we are not starting from ASE. (And is the format I think we should provide when we will introduce the pydantic PR in aiida. That's another story, so let's forget about this for now.)

Let's say I want to keep fixed the first N atom. In the new StructureData, this is a property that should be set for the first site.

Mutable version:

structure.sites[0].selective_dynamics = (False, False, False) # no allowed movements in (x, y, z)
structure.attributes

This gives:

{'cell': [[2.0, 0.0, 0.0], [0.0, 2.0, 0.0], [0.0, 0.0, 2.0]],
...
 'sites': [
        {'symbol': 'N',
           'mass': 14.007,
           'kind_name': 'N',
           'position': (0.0, 0.0, 0.0),
           'charge': 0.0,
           'magnetization': 0.0,
           'selective_dynamics': (False,False,False)}, # <=== Added a new property in the first site!
...
      ],
}

Let's say that I stored the structure in AiiDA, by using .store() or/and by running a WorkChain/Calcjob using it as input. Then I want to modify again: now let's see what happens by letting the first N atom to move.

structure = load_node(stored_structure_pk)
structure.sites[0].selective_dynamics = (True, True, True) # allowed movements in (x, y, z)

raises (or will raise, but this is how I see the API in the mutable version)

ModificationNotAllowed: ...the node is already stored. If you want to modify it, you should copy it via the node.clone() method, which returns an unstored copy of the node. 

the same will happen if we try to invoke append_atom(...). The users then knows what to do:

structure = load_node(stored_structure_pk).clone()
structure.sites[0].selective_dynamics = (True, True, True) # allowed movements in (x, y, z)

This is slightly involved to me. But still acceptable.

Immutable version:

suppose we already initialized the structure from_ase, as before. Of course structure.sites[0].selective_dynamics = (False, False, False) will not work, our node is already immutable before being stored. The raising error will tell the user to create a new instance of the StructureData. Here, not even the .clone() trick will work. So we should always initialize a new StructureData instance, which contains the new information on the selective dynamics:

This gives:

new_structure_dict = structure.to_dict() # or just structure.serialize() in pydantic PR way
new_structure_dict["sites"][0]["selective_dynamics"] = (False, False, False)
new_structure = StructureData(**new_structure_dict)

Now the user should deal with a dictionary, modify it, using it to initialize a new StructureData. I find this more involved from a user perspective than the mutable version of doing things. We let the user deal with a dictionary which somehow should have a 1-to-1 correspondence to the a crystal structure. However, I see the point of @sphuber here, and how it can be error-safe to not let the StructureData be mutable. We have only one way to modify a structure, i.e. modifying the inputs and initializing a new one. Once the node is initialized, the user can inspect it and really do not expect anything to change (it cannot access any mutation method).

(2) More critical example: adding the Hubbard property, then deleting one atom: how to reorganize data?

Let's assume we have our StructureData initialized to 3 atoms and including the Hubbard property, which can be seen as a list like this:

hubbard_list = [
    (0, '3d', 0, '3d', 7.2362, (0, 0, 0), 'V'), # atom_index, atom_manifold, neighbour_index, neighbour_manifold, value, translation, hubbard_type
    (0, '3d', 2, '2p', 0.2999, (-1, 0, -1), 'V'),
    (0, '3d', 1, '2p', 0.2999, (0, 0, -1), 'V'),
    (0, '3d', 1, '2p', 0.2999, (-1, 0, 0), 'V'),
    (0, '3d', 2, '2p', 0.2999, (0, -1, -1), 'V'),
    (0, '3d', 2, '2p', 0.2999, (-1, -1, 0), 'V'),
    (0, '3d', 1, '2p', 0.2999, (0, -1, 0), 'V')
]

and is stored in the structure.hubbard attribute (as there are also inter-site interactions, we should store it as global property, not under each site). Let's also suppose we want to then delete the third atom, we don't want it to be in the new StructureData. So then I should use the delete_atom method, if the node is mutable:

structure.delete_atom(index = 2)
structure.hubbard

which gives:

 [
    (0, '3d', 0, '3d', 7.2362, (0, 0, 0), 'V'),
    (0, '3d', 1, '2p', 0.2999, (0, 0, -1), 'V'),
    (0, '3d', 1, '2p', 0.2999, (-1, 0, 0), 'V'),
    (0, '3d', 1, '2p', 0.2999, (0, -1, 0), 'V')
]

we see that, in the hubbard property, all the elements involving the third site (i.e. of index=2) are removed automatically. This thanks to the delete_atom method, which internally checks for the hubbard interactions which are no more there. If the StructureData node is immutable, we should proceed in this way:

new_structure_dict = structure.to_dict() # or just structure.serialize() in pydantic PR way

new_structure_dict["sites"].pop(2) # remove the third site/atom
new_hubbard = [interaction for interaction in new_structure_dict["hubbard"] if (interaction[0]!=2 or interaction[2] != 2)]
new_structure_dict["hubbard"]=new_hubbard

new_structure = StructureData(**new_structure_dict)

To do this, the user needs to know how the hubbard property is structured. The user working with hubbard will know for sure, but the user loading a structure from a database and then trying to just remove atoms may accidentally ignore the hubbard property and so encounter issues in the run (concrete cases being removing termination from ribbons, splitting heterostructures, or removing adsorbed from slabs). Well, then there will be an exception and the user will know that he has to change also hubbard, if the exception is detailed enough. But then every time he wants to delete an atom, he needs this code snippet, instead of just very simply invoking the delete_atom after having cloned the node. The code snippet above is not really impossible thing to write. But imagine to remove not the last atom, but the second one: then also the indexes in the hubbard should be updated, making things more complicated for the user (the delete_atom should deal with this).
Moreover, we cannot go back to ASE/pymatgen, remove the atom and then go back: we will lose hubbard information, not supported for these non-AiiDA classes.

Comment:

I am not sure we are comparing the same kind of user friendliness, where one is about not letting the user make mistakes, one is on letting the user make mistakes but easily access methods to do what he wants, avoiding dumping dictionaries and looping on them. I have the feeling that a very concise documentation on the behavior of AiiDA nodes can help if we go for mutability. However, as I already wrote above, the immutability has on its side the predictability of the behavior (basically no behavior).

Why for StructureData is so difficult to accept immutability?

The thing that I see is that, at variance with other AiiDA Data type as Dict, List, we have not the strict analogous python data type for StructureData. In my opinion, if we have immutable StructureData, we should provide a python class which is able to capture all the aspect of what we store in the StructureData node, and which implements methods to modify itself (append_atom, initialise_hubbard and so on). I know that this means supporting an additional class, additional code, but I think it means also help the user to work with AiiDA. In principle, with this new class, the StructureData will have not a lot of methods, and will be really just a container for a crystal structure. We can put validation methods which raise error before the .store(). or during the initialization of the instance.

structure_python_class = StructurePython.from_ase(ase_atoms)
structure_python_class.append_atom(symbol="N", position=(1,1,1),...)
structure = StructureData.from_python_structure(structure_python_class)
structure.store()
new_structure_to_be_modified = structure.to_structure_python_class()

we can even think to let WorkChains and Calcjobs to automatically serialize StructurePython objects to StructureData ones if provided as input.

I see this concept of having the python counterpart class also for other atomistic data types, like KpointsData. But let's see first if it is a good idea or not :smile:

sphuber commented 1 month ago

Thanks a lot for the detailed writeup @mikibonacci this is exactly the thing we need.

Let me first start with your selective dynamics example. You mention that setter-methods, for example append_atom, can simply raise a ModificationNotAllowedError if the node was stored. Very well, this would pose no problem indeed.

You claim the mutable version would provide for a better user experience, for example nested properties, such as selective dynamics can be accesses through properties:

structure.sites[0].selective_dynamics = (True, True, True) # allowed movements in (x, y, z)

and then when the node was stored, this would simply raise again. This is exactly where the problem lies though. How do you plan on having this raise a ModificationNotAllowedError if the node is stored? You are not assigning directly on the structure, but on some type that was returned by .sites[0].selective_dynamics. This is essentially exactly the "bug" described in https://github.com/aiidateam/aiida-core/issues/5970 for the Dict node.

What probably would happen is that the line that changes the selective dynamics is executed without an exception, but the structure will not actually have been changed at all. And this happens without any warning or error. The user will be left scratching their head why the structure wasn't changed.

This is exactly the confusing behavior that I was warning about.

Now I do agree that if you limit the interface to provide no mutating methods, it probably does make it less straightforward to use. The example about selective dynamics, I don't find the most convincing though. The variant

structure.sites[0].selective_dynamics = (True, True, True)

is not all that much more straightforward than

new_structure_dict["sites"][0]["selective_dynamics"] = (False, False, False)

The Hubbard example is more convincing. I do agree that having a delete_atom method that would automatically correct all the internal properties, such as Hubbard, would be a lot more convenient than having to do it manually. I wonder if there are other solutions though? When it comes to Hubbard, there will already be utility functions to even build the initial mapping. I don't think users would build up the hubbard_list you provide an example of, manually, would they? Couldn't there be similar functions that update those lists?

And was there not the concept of allowing custom properties? Is that still in the plans? And if so, how do you plan that the delete_atom would automatically now to update them to keep them consistent? If the properties are custom, I imagine there would not be a very strict schema, would there be?

mikibonacci commented 1 month ago

Thanks @sphuber for the reply.

I see the point of modifying structure.sites[0].selective_dynamics, and indeed this will not raise the error as the elements of the structure.sites list don't know about the structure object, because actually they are not attached to the structure, only their list is. This also is a problem for unstored nodes as, changing the selective_dynamic of a single site directly, will not change the structure.base.attributes["sites"][0] element - really an unexpected behavior from user perspective.

This is indeed the same problem you mentioned for the Dict. However, here the situation is slightly different: structure.sites[0] is a Site object, so we can implement the possibility to attach to the Site instance also the parent node (i.e. the StructureData). Briefly, the implementation should be something like (forgetting pydantic, for now):

class Site:

    def __init__(self, parent,  **kwargs):
         ...
        self.parent = parent # StructureData
    ...
    @selective_dynamics.setter
    def selective_dynamics(self, value):
        if self.parent.is_stored:
              raise ModificationNotAllowed 
        self._selective_dynamics = value
        self.parent.base.attributes.set("sites", self.parent.sites)

I see this providing the expected behavior for the user.

Instead, for the Dict, this self.parent "trick" would be nearly impossible, because of the fact that we are directly dealing with a python built-in data type. I see you wrote a possible solution here, where the idea is to have a custom dictionary class for the nested dictionary (correct me if I misunderstood).

About the Hubbard example, I think the user will provide exactly a list as I did; I followed this tutorial as written by @bastonero. Other methods (as you can see here) can be used to append/remove Hubbard parameters, but they still require a mutable HubbardStructureData. Even if we had utility functions to update those lists, the user still needs to use that functions: he needs to know when and how to use them.

Maybe, just providing a delete_atom function to act consistently on the structure-like python dictionary can be a solution. But I don't see than many differences with respect to a non-AiiDA StructurePython (pseudo random name, sorry) class which contains the utility functions as methods to modify itself. In the end, it will be an AttributeDict with additional methods.

For what concerns the custom properties, I think we should just let the user know that we don't provide helper methods for them, as indeed there will be no strict schema. I guess this is fair and will not be a big problem, as the user that wants to store custom properties will usually also be "expert" one, and he will take care to modify itself the dictionary.

sphuber commented 1 month ago

Your Site example is indeed a way to work around the problem I described. It is indeed equivalent to the workaround I mentioned for the Dict as you mentioned. The main point of this example is exactly to highlight how easy it is to have unexpected behavior with the mutable design.

This also is a problem for unstored nodes as, changing the selective_dynamic of a single site directly, will not change the structure.base.attributes["sites"][0] element - really an unexpected behavior from user perspective.

Surely this is then more evidence that this design is perhaps not the best, whether it is for mutable or immutable?

For what concerns the custom properties, I think we should just let the user know that we don't provide helper methods for them, as indeed there will be no strict schema. I guess this is fair and will not be a big problem, as the user that wants to store custom properties will usually also be "expert" one, and he will take care to modify itself the dictionary.

This is the part I don't get. If you do want to support custom properties, surely you would have to make sure that the methods that are supposed to coherently manage the internal state (e.g. delete_atom) also function for these properties?

Again, I see the point of wanting to make a class that provides useful methods to mutate the internal state, however, how useful can it really be if there is a ton of caveats. It would be like saying:

The StructureData provides a user-friendly interface to manipulate crystal structures, with (Python) properties to retrieve structure properties and methods to change the structure. But be careful, don't use the properties to change the structure. And also, the methods don't guarantee consistency for all properties when updating. And probably there are more pitfalls...

Anyway, I think the point is clear. Maybe this is still a more user-friendly solution than the immutable variant without properties/utility methods. It is clear there is no perfect solution and it is a question of tradeoffs. I would personally err on the safe side and keep the interface minimal and treat the class just as a data container, but if others think the mutable variant is better, that is fine as well.

GeigerJ2 commented 1 month ago

Hey guys, thank you for these great comments! I agree with @sphuber, that this really seems like we have to accept some tradeoffs eventually.

To put some more of my thoughts out there for discussion:

1) How about we make the StructureData fully immutable and instead of providing the functions to "modify" it as methods of the class, we provide them as normal functions in some other file, e.g. utils.py? These functions would then take a StructureData as input argument and return a new StructureData instance. It could make the explanation in the docs quite simple, because we can really separate the two concepts, i.e. "StructureData itself is always immutable. Utility functions to "modify" a StructureData always return a new instance." At least to me, as the modification functions are not methods of the class itself, this would clearly indicate that they cannot be used to alter its internal state. One would require a bit more code through the additional import, but at least the complexity of dealing with the dict / serializable representation would be hidden away from the user. So it might be a reasonable compromise for the approaches discussed here so far?

2) It still doesn't prevent the user from direct assignment like:

structure.sites[0].selective_dynamics = (True, True, True)

though. I'm wondering now how one would do that without the workaround? Possibly by returning a custom class (e.g. PropertyCollection) when the user accesses one of the properties via dot-notation, which does not support indexing? This would prevent anyone from running the above snippet already in the first place.

3) For the utility functions that return modified StructureData instances, the internal operations on the dict / serializable representation could be abstracted, depending on the nature of the property, which is one of the following: i) Just one "value" per Site (e.g. symbol, mass, charge, magnetization) ii) Array per Site (e.g. positions, selective_dynamics (for x,y,z)) iii) Inter-site properties with arrays that go over all other atoms (e.g. hubbard - is there anything else?)

Adding or removing a Site should work in the same way for all properties of the same group. Using these general functions in the functions that modify the dict / serializable representation (e.g. adding a Site, removing a Site, or changing one inside the StructureData, accessible via index, symbol, etc.), could then work in the same way, also for additional custom properties? Though, I might be naive here and missing something, I just started thinking about this very superficially...

4) Lastly, I'm somewhat opposed to having two classes, which the user has to deal with, and wonder why one is mutable and the other immutable. Also, they might end up just always using the mutable one (at least that's what I did when working with pymatgen, which provides the Structure/Molecule and Istructure/IMolecule where the latter two are immutable versions of the former two).

mikibonacci commented 3 weeks ago

Hi @GeigerJ2 and @sphuber , I finally got the time to go back again on StructureData.

On the utility functions

About the utility functions proposed by @GeigerJ2 , I think this will be sub-optimal as then the user has to know which utility functions we provide (i.e. no tab completion), for example looking at a documentation. The fact of returning a new AiiDA node each time is something I don't like as then the user may have to write some code like this:

structure_old = StructureData(...)
for new_atom in atoms:
      structure_old = append_atom(structure_old, new_atom)

at least this what I would write to append more than one atom in my structure. I see however that this guarantees immutability and no ambiguities on how to deal with StructureData nodes. Another great advantage is that then we can add utilities without breaking/modifying the StructureData class, as it's just a container. Indeed, we can implement methods to avoid the loop done in the above code snippet:

structure_old = StructureData(...)
structure_old = append_several_atoms(structure_old, atom_list=...)
structure_old = append_hubbard_property(structure_old, hubbard_list=...)

Still on mutable non-AiiDA classes

On the other hand, having a python object containing such utilities as methods can provide the beloved tab completion and mutability of itself (it's not AiiDA). The user indeed will end up always using this mutable class, and it make sense in some way: the StructureData will be just the container and only used in the Calcjobs/WorkChains. Do you see some issue in this @sphuber , @GeigerJ2 ? The problem of having this python class can be the following: the user may start from ASE/Pymatgen, then convert it to this class to add Hubbard, then initialise the StructureData node from this. This would be then a three-step process which I don't like. However, if we allow automatic input serialization python class -> StructureData, this will be then the two-step process which we usually do (ASE-> StructureData). Probably we should provide this also for ASE atoms (maybe we already do it somewhere, @superstar54 ?).

ASE just released

Another option would be to contribute to ASE (they just released) and add methods there. Or even build a new class on top of it which just extend with the properties we need to add, like Hubbard.

Bit on custom properties

For the custom properties, I don't see providing much support for modifying them consistently as indeed a custom property can be something new, e.g. some parameter concerning three sites together (like a plane in the cell?) and so we cannot know a priori how to deal with this.

Comment:

I think the solutions are then:

1 - immutability with detached utilities (so basically the immutability) 2 - immutability with additional python class (methods are the above utilities) 3 - mutability with all the methods in there 4 - some other option?

superstar54 commented 3 weeks ago

Happy to join this hot discussion! :laughing: Based on my experiences with WorkGraph (especially PythonJob and REST API), I would like to give my two cents.

1) All AiiDA data should be immutable, whether stored or not. This will prevent unpredictable behavior from user modifications. 2) Based on 1), AiiDA data is just a container of raw Python data. The AiiDA data should not contain any method to edit the attributes of the raw Python data. The only data one can read from AiiDA data is its raw Python data, not any attribute of the raw Python data, to prevent the user from editing the attributes, e.g., one can not read structure.pbc, one should use structure.value.pbc. structure.value is the raw Python data. 3) A consistent and straightforward API across all data types. For accessing raw Python data, the method value should be used uniformly across data types, replacing more specific methods like List.get_list() and Dict.get_dict() with List.value and Dict.value, respectively. For initializing the AiiDA Data, we should pass the raw Python data as a whole into the AiiDA Data class, instead of append_atom, set_kpoints_mesh etc. 4) Based on above, I suggest creating a Python data class for the structure with all methods to edit and read its attributes, or use an existing one, like ase.Atoms or pymatgen.Structure.

khsrali commented 3 weeks ago

@mikibonacci thanks for your the good example

we see that, in the hubbard property, all the elements involving the third site (i.e. of index=2) are removed automatically. This thanks to the delete_atom method, which internally checks for hubbard interactions, which are no more there.

Maybe this is a naive suggestion. But, to be fair, that could also be done if the structure is immutable, just need to do everything behind the scenes and return a new instance.

We could either:

a) Allow mutable when unstored,

but then we should always raise ModificationNotAllowedError on a flag like copy=False when data is stored. and the default value is copy=False, unless explicitly set on True

new_structure = structure.delete_atom(index = 2, copy=True)
new_structure.hubbard

and if they wrongly do

structure.delete_atom(index = 2)
structure.hubbard

This would raise: ModificationNotAllowedError: Cannot modify immutable object; please set copy=True to return a new instance

b) Keep it immutable,

then return a new instance for all operations like this delete, append, etc.. and we advertise that this is the right way to do it:

new_structure = structure.delete_atom(index = 2)
new_structure.hubbard