aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
434 stars 187 forks source link

Define purpose of __repr__, __str__ for AiiDA nodes #3087

Open ltalirz opened 5 years ago

ltalirz commented 5 years ago

@chrisjsewell 's graphviz PR #2596 needs to know which "short description" should be shown for a given AiiDA node.

It makes sense that this information be provided by the node itself rather than e.g. by the graph visualization tool.

Status quo

There are (at least) two built-in python functions used for this:

Their implementations in AiiDA often differ - see below:

In [1]: a= Int(30)

In [19]: a
Out[19]: <Int: uuid: 15f11fcf-02fa-4bbf-b3ef-ee1139f3206c (unstored) value: 30>

In [2]: print(a)
uuid: 15f11fcf-02fa-4bbf-b3ef-ee1139f3206c (unstored) value: 30

In [3]: from aiida.orm import Computer

In [4]: qb=QueryBuilder()

In [5]: qb.append(Computer)
Out[5]: <aiida.orm.querybuilder.QueryBuilder at 0x111300080>

In [6]: l=qb.all()

In [7]: l
Out[7]:
[[<Computer: localhost (localhost), pk: 1>],
 [<Computer: localhost2 (localhost), pk: 2>],
 [<Computer: daint-s888 (daint.cscs.ch), pk: 3>],
 [<Computer: fidis (fidis.epfl.ch), pk: 4>]]

In [9]: print(l[0][0])
localhost (localhost), pk: 1

In [10]: from aiida.orm import StructureData

In [11]: qb=QueryBuilder()

In [12]: qb.append(StructureData)
Out[12]: <aiida.orm.querybuilder.QueryBuilder at 0x111358ba8>

In [14]: s=qb.first()

In [17]: print(s)
uuid: da407675-294f-4e73-8a00-5ea5c37601d0 (pk: 2)

In [20]: qb=QueryBuilder()

In [21]: qb.append(User)
Out[21]: <aiida.orm.querybuilder.QueryBuilder at 0x11138c208>

In [22]: u=qb.first()

In [24]: u[0]
Out[24]: <aiida.orm.users.User at 0x111398f98>

In [25]: print(u[0])
leopold.talirz@epfl.ch

Proposal

The python docs say on __repr__

This is typically used for debugging, so it is important that the representation is information-rich and unambiguous.

So, I believe a value like <Int: value: 30, uuid: 15f11fcf-02fa-4bbf-b3ef-ee1139f3206c (unstored) > makes sense here.

The python docs say on __str__

[...] the “informal” or nicely printable string representation of an object. [...] a more convenient or concise representation can be used.

Do we even need to show the UUID here? This could be the description that Chris uses, e.g. value: 30.

Even if we decide to show the UUID in both cases, I think we should factor out the value: 30 part so that it can be reused by __str__, by __repr__ and by @chrisjsewell .

@sphuber @giovannipizzi Please let me know your take on this.

CasperWA commented 3 years ago

This came up again and some discussion is present in the PR #4999 between myself and @sphuber.

sphuber commented 3 years ago

The typical "consensus" online on what the __repr__ and __str__ should represent seems to boil down to:

__str__ should return a human-readable description of the object and __repr__ is intended for developers during debugging

This seems to be the pragmatic outcome of the original intended use, which is described by the official Python docs:

object.repr(self) Called by the repr() built-in function to compute the “official” string representation of an object. If at all possible, this should look like a valid Python expression that could be used to recreate an object with the same value (given an appropriate environment). If this is not possible, a string of the form <...some useful description...> should be returned. The return value must be a string object. If a class defines repr() but not str(), then repr() is also used when an “informal” string representation of instances of that class is required. This is typically used for debugging, so it is important that the representation is information-rich and unambiguous.

object.str(self) Called by str(object) and the built-in functions format() and print() to compute the “informal” or nicely printable string representation of an object. The return value must be a string object. This method differs from object.repr() in that there is no expectation that str() return a valid Python expression: a more convenient or concise representation can be used. The default implementation defined by the built-in type object calls object.repr().

The most important part in the __repr__ description for me is "If at all possible". I think that in the majority of cases, it will be impossible to construct a string that when passed to eval will properly reconstruct the object, especially for our ORM classes. The reason is that they are compound objects, i.e., they take objects as argument that themselves are complex objects. For example, just the Node does not just take a simple value, but also the backend, user and other objects that themselves are difficult, if not impossible, to reconstruct from a string. Luckily, however, all our ORM objects have a unique identifier, their UUID. Since the main goal of the __repr__ is to be as unambiguous as possible, such that in debugging the object can be properly identified, using the UUID seems very natural to me. That's why I propose, that as a basis for the __repr__ of ORM classes, we use the form

f'<{object.__class__}:{object.uuid}>'

Having this simple form will also make it easy to consistently apply it to all the ORM classes.

Then, for the __str__ it is a bit more complicated because I think there is a trade off to be made between two goals:

  1. Give it enough information for it to be useful
  2. Give it not too much information for it not to be too noisy when used in

The reason that sparked this conversation again is the suggestion to define the __str__ for all nodes classes such that in CLI command outputs, we can simply call f'Some node {node}' to get the desired representation. If we only ever rely on the __str__ and not use ad-hoc formatting in the output, ORM classes will always be represented in an identical way in the output of all CLI commands, which would be very valuable. If this is the main use case though, I would argue that in that case we should keep the information at a minimum. To see why, consider the example of verdi group create. At the end of the command, it will probably print something like:

echo.echo_success(f'Created the group {group}.')

If we only consider goal 1, where we would put as much information as possible, we would get something like:

Success: Created the group `Group<ec4f4420-af4a-41c2-baa0-8df32f28ecbd|123123|user: default@localhost| my_group| 20210506 12:34:56>`.

Now I would argue that this is too much information and is not actually useful. When you get many of these representations in a piece of output, it will become overwhelming and unreadable. A much better output would be:

Success: Created the group `Group<my_group>`

Since the label is unique for groups, this unambiguously identifies it and should give all the information that is needed. If more is needed, the label can be easily used in verdi group show <LABEL>. The one downside here is that not all ORM classes have a unique label, for example Node, and in that case using the label will be much less effective. In that case, I would use the pk instead. The UUID could also work, but since it is quite long it can be a bit noisy. The only advantage of the UUID is that it is cross-database compatible, but I would argue that for CLI output this is not important whatsoever. To me then it seems that using the pk as the base identifier in output seems to be the best choice and so as a default, I would propose the following implementation for __str__:

def __str__(self):
    if self.is_stored:
        return f'{self.__class__.__name__<self.pk>}
    return f'{self.__class__.__name__<self.uuid>}

If we use this consistently for all classes, I think this would bring a lot of value to the uniformness in command outputs.

The one thing that we can still consider is to have some room for some additional custom information based on the particular sub class. It can be argued that for, for example, base Data types, it is very useful to also show the value. For example for Int, having Int<140> 5 maybe a lot more useful than simply Int<140> and does not contain too much information. However, the exact formatting may not be so easy (especially when it is used in a sentence "Created the node Int<140> 5 successfully" just looks weird.) and the concept is not really applicable to all classes, not even all base data types. Case in point, how do you do this for a Dict node? Again, I am thinking that consistency across all node classes is worth more than having a slight informational gain in some cases.

TLDR: I propose the following two implementations for all ORM classes:

def __repr__(self):
    if self.is_stored:
        return f'<{self.__class__}:{self.pk}>'
    return f'<{self.__class__}:{self.uuid}>'

def __str__(self):
    if self.is_stored:
        return f'{self.__class__.__name__}<{self.pk}>'
    return f'{self.__class__.__name__}<{self.uuid}>'

Which would look like:

group = Group('some-label)
group
<aiida.orm.groups.group.Group:ec4f4420-af4a-41c2-baa0-8df32f28ecbd>
print(group)
Group<ec4f4420-af4a-41c2-baa0-8df32f28ecbd>

group.store()
group
<aiida.orm.groups.group.Group:123>
print(group)
Group<123>
CasperWA commented 3 years ago

This seems like a good compromise. I'd probably add a bit more information about what parameter we're showing in the __str__ output, e.g., Group<pk=123> or Group<pk:123>, and instead of using the PK for __repr__ I'd always go with the UUID instead. Even when stored. Either that or it's both PK and UUID when stored (for local simplicity when debugging or similar). And I like the current (unstored) string for entities as well, so it could be something like <aiida.orm.groups.group.Group | unstored | ec4f4420-af4a-41c2-baa0-8df32f28ecbd> and then <aiida.orm.groups.group.Group | 123 | ec4f4420-af4a-41c2-baa0-8df32f28ecbd>?

Concerning the extra information for certain Node types, especially the base Node types, I think we should be flexible here and be "friendly". But only with the __str__ representation. Essentially mirroring what the base Python types do as well for this, and try to mimick that output - but within the framework that we've already set up here. So an example for Int could be Int<pk=123 | value=5> or something? For Dict it would be Dict<pk=321 | value={'my_key': 'my-value', 'my_nested-dict': {'see me nesting': 'I\'m a bird'}}>. While this makes for a large output, it also makes a lot of sense in the context of the Node type to have everything. One can even argue to just print the exact same as would be printed for a dict, but I personally like the reminder that this is not a dict but an AiiDA Dict node...

giovannipizzi commented 3 years ago

As an addition, that might help (maybe already mentioned?): note that nodes also have a get_description() method that returns a much longer description. Long information might be put there (and we can advertise it more as it's not a standard feature). Also, as discussed elsewhere (I don't remember where...) we could/should also add some methods so that if people are using the classes in jupyter, they see some nicer HTML output where we can use colors, fonts, etc. - this would work without them having to know much, and it's nicer graphically.

sphuber commented 3 years ago

I'd probably add a bit more information about what parameter we're showing i

Don't think this is necessary. The beauty of only ever showing the PK or UUID is that this will always make it implicitly, yet unambiguously, clear which is being shown because they are guaranteed to have different formats.

For Dict it would be Dict<pk=321 | value={'my_key': 'my-value', 'my_nested-dict': {'see me nesting': 'I\'m a bird'}}>. While this makes for a large output, it also makes a lot of sense in the context of the Node type to have everything.

Again, I would explicitly not do this. I think that we should define __str__ such that we can always use it effectively in CLI commands. Once you start adding these long outputs, that simply is no longer the case. Just for example, take verdi node comment, where you probably want to say at the end something like:

echo.echo_success(f'Added comment to {node}')

All I want to see here is

Success: Added comment to Node<15123>

If this happens to be an output_parameters Dict node from a calcjob, with your suggestion you would get

Success: Added comment to Node<pk=321 | value={"absolute_magnetization": 2.58,"absolute_magnetization_units": "Bohrmag / cell""atomic_charges_units": "e""atomic_magnetic_moments_units": "Bohrmag / cell""beta_real_space": false,"charge_density": "./charge-density.dat""constraint_mag": 0,"convergence_info": {"scf_conv": {"convergence_achieved": true,"n_scf_steps": 20,"scf_error": 1.1066269391433e-11}    },"creator_name": "pwscf""creator_version": "6.5""degauss": 0.136056917253,"dft_exchange_correlation": "PBE""do_magnetization": true,"do_not_use_time_reversal": false,"energy": -8959.052366436,"energy_accuracy": 2.993252179566e-10,"energy_accuracy_units": "eV""energy_ewald": -4799.3964595658,"energy_ewald_units": "eV""energy_hartree": 1739.8835448164,"energy_hartree_units": "eV""energy_one_center_paw": -2072.2925053599,"energy_one_center_paw_units": "eV""energy_one_electron": -2881.2005051896,"energy_one_electron_units": "eV""energy_smearing": -0.00081552516201448,"energy_smearing_units": "eV""energy_threshold": 1.21e-12,"energy_units": "eV""energy_xc": -946.04562574798,"energy_xc_units": "eV""estimated_ram_per_process": 422.38,"estimated_ram_per_process_units": "MB""estimated_ram_total": 844.77,"estimated_ram_total_units": "MB""exit_status": 0,"fermi_energy": 19.3607,"fermi_energy_units": "eV""fft_grid": [60,60,60    ],"forces_units": "ev / angstrom""format_name": "QEXSD""format_version": "19.03.04""has_dipole_correction": false,"has_electric_field": false,"init_wall_time_seconds": 84.8,"inversion_symmetry": true,"lattice_symmetries": [],"lda_plus_u_calculation": false,"lkpoint_dir": false,"lsda": true,"magnetization_angle1": [0.0,0.0    ],"magnetization_angle2": [0.0,0.0    ],"monkhorst_pack_grid": [23,23,23    ],"monkhorst_pack_offset": [0,0,0    ],"no_time_rev_operations": false,"non_colinear_calculation": false,"number_of_atomic_wfc": 26,"number_of_atoms": 2,"number_of_bands": 20,"number_of_bands_down": 20,"number_of_bands_up": 20,"number_of_bravais_symmetries": 48,"number_of_electrons": 32.0,"number_of_k_points": 364,"number_of_species": 2,"number_of_spin_components": 2,"number_of_symmetries": 48,"occupations": "smearing""q_real_space": false,"rho_cutoff": 14694.147063324,"rho_cutoff_units": "eV""scf_iterations": 20,"smearing_type": "mv""smooth_fft_grid": [32,32,32    ],"spin_orbit_calculation": false,"spin_orbit_domag": true,"starting_magnetization": [-0.25,0.25    ],"stress_units": "GPascal""symmetries": [{"symmetry_number": 0,"t_rev": "0"},{"symmetry_number": 1,"t_rev": "0"},{"symmetry_number": 2,"t_rev": "0"},{"symmetry_number": 3,"t_rev": "0"},{"symmetry_number": 4,"t_rev": "0"},{"symmetry_number": 5,"t_rev": "0"},{"symmetry_number": 6,"t_rev": "0"},{"symmetry_number": 7,"t_rev": "0"},{"symmetry_number": 8,"t_rev": "0"},{"symmetry_number": 9,"t_rev": "0"},{"symmetry_number": 10,"t_rev": "0"},{"symmetry_number": 11,"t_rev": "0"},{"symmetry_number": 12,"t_rev": "0"},{"symmetry_number": 13,"t_rev": "0"},{"symmetry_number": 14,"t_rev": "0"},{"symmetry_number": 15,"t_rev": "0"},{"symmetry_number": 16,"t_rev": "0"},{"symmetry_number": 17,"t_rev": "0"},{"symmetry_number": 18,"t_rev": "0"},{"symmetry_number": 19,"t_rev": "0"},{"symmetry_number": 20,"t_rev": "0"},{"symmetry_number": 21,"t_rev": "0"},{"symmetry_number": 22,"t_rev": "0"},{"symmetry_number": 23,"t_rev": "0"},{"symmetry_number": 32,"t_rev": "0"},{"symmetry_number": 33,"t_rev": "0"},{"symmetry_number": 34,"t_rev": "0"},{"symmetry_number": 35,"t_rev": "0"},{"symmetry_number": 36,"t_rev": "0"},{"symmetry_number": 37,"t_rev": "0"},{"symmetry_number": 38,"t_rev": "0"},{"symmetry_number": 39,"t_rev": "0"},{"symmetry_number": 40,"t_rev": "0"},{"symmetry_number": 41,"t_rev": "0"},{"symmetry_number": 42,"t_rev": "0"},{"symmetry_number": 43,"t_rev": "0"},{"symmetry_number": 44,"t_rev": "0"},{"symmetry_number": 45,"t_rev": "0"},{"symmetry_number": 46,"t_rev": "0"},{"symmetry_number": 47,"t_rev": "0"},{"symmetry_number": 48,"t_rev": "0"},{"symmetry_number": 49,"t_rev": "0"},{"symmetry_number": 50,"t_rev": "0"},{"symmetry_number": 51,"t_rev": "0"},{"symmetry_number": 52,"t_rev": "0"},{"symmetry_number": 53,"t_rev": "0"},{"symmetry_number": 54,"t_rev": "0"},{"symmetry_number": 55,"t_rev": "0"}    ],"symmetries_units": "crystal""time_reversal_flag": true,"total_force": 0.0,"total_force_units": "ev / angstrom""total_magnetization": 0.0,"total_magnetization_units": "Bohrmag / cell""total_number_of_scf_iterations": 20,"volume": 21.839617220476,"wall_time": "      1h13m ","wall_time_seconds": 4380.0,"wfc_cutoff": 1224.512255277,"wfc_cutoff_units": "eV"
}>

How is this useful? I know I picked an extreme example, but that is just the point. There is no way to make this work for all node types. All we really need is the UUID or pk. There are all sorts of other commands to figure out more information, should you need to, using that identifier.

That being said, I am not saying there shouldn't be some other method that provides richer information. Like @giovannipizzi mentioned, there is the get_description that exists for certain classes that might be a better solution for this. Or even better, indeed some way to provide any number of rich formats with detailed information. The difference is that these should be called explicitly. I think the default display (as provided by __str__) should be as consistent and as well-suited for inline use in text as possible.