Open ltalirz opened 5 years ago
This came up again and some discussion is present in the PR #4999 between myself and @sphuber.
The typical "consensus" online on what the __repr__
and __str__
should represent seems to boil down to:
__str__
should return a human-readable description of the object and__repr__
is intended for developers during debugging
This seems to be the pragmatic outcome of the original intended use, which is described by the official Python docs:
object.repr(self) Called by the repr() built-in function to compute the “official” string representation of an object. If at all possible, this should look like a valid Python expression that could be used to recreate an object with the same value (given an appropriate environment). If this is not possible, a string of the form <...some useful description...> should be returned. The return value must be a string object. If a class defines repr() but not str(), then repr() is also used when an “informal” string representation of instances of that class is required. This is typically used for debugging, so it is important that the representation is information-rich and unambiguous.
object.str(self) Called by str(object) and the built-in functions format() and print() to compute the “informal” or nicely printable string representation of an object. The return value must be a string object. This method differs from object.repr() in that there is no expectation that str() return a valid Python expression: a more convenient or concise representation can be used. The default implementation defined by the built-in type object calls object.repr().
The most important part in the __repr__
description for me is "If at all possible". I think that in the majority of cases, it will be impossible to construct a string that when passed to eval
will properly reconstruct the object, especially for our ORM classes. The reason is that they are compound objects, i.e., they take objects as argument that themselves are complex objects. For example, just the Node
does not just take a simple value, but also the backend
, user
and other objects that themselves are difficult, if not impossible, to reconstruct from a string. Luckily, however, all our ORM objects have a unique identifier, their UUID. Since the main goal of the __repr__
is to be as unambiguous as possible, such that in debugging the object can be properly identified, using the UUID seems very natural to me. That's why I propose, that as a basis for the __repr__
of ORM classes, we use the form
f'<{object.__class__}:{object.uuid}>'
Having this simple form will also make it easy to consistently apply it to all the ORM classes.
Then, for the __str__
it is a bit more complicated because I think there is a trade off to be made between two goals:
The reason that sparked this conversation again is the suggestion to define the __str__
for all nodes classes such that in CLI command outputs, we can simply call f'Some node {node}'
to get the desired representation. If we only ever rely on the __str__
and not use ad-hoc formatting in the output, ORM classes will always be represented in an identical way in the output of all CLI commands, which would be very valuable. If this is the main use case though, I would argue that in that case we should keep the information at a minimum. To see why, consider the example of verdi group create
. At the end of the command, it will probably print something like:
echo.echo_success(f'Created the group {group}.')
If we only consider goal 1, where we would put as much information as possible, we would get something like:
Success: Created the group `Group<ec4f4420-af4a-41c2-baa0-8df32f28ecbd|123123|user: default@localhost| my_group| 20210506 12:34:56>`.
Now I would argue that this is too much information and is not actually useful. When you get many of these representations in a piece of output, it will become overwhelming and unreadable. A much better output would be:
Success: Created the group `Group<my_group>`
Since the label
is unique for groups, this unambiguously identifies it and should give all the information that is needed. If more is needed, the label can be easily used in verdi group show <LABEL>
. The one downside here is that not all ORM classes have a unique label, for example Node
, and in that case using the label will be much less effective. In that case, I would use the pk instead. The UUID could also work, but since it is quite long it can be a bit noisy. The only advantage of the UUID is that it is cross-database compatible, but I would argue that for CLI output this is not important whatsoever. To me then it seems that using the pk as the base identifier in output seems to be the best choice and so as a default, I would propose the following implementation for __str__
:
def __str__(self):
if self.is_stored:
return f'{self.__class__.__name__<self.pk>}
return f'{self.__class__.__name__<self.uuid>}
If we use this consistently for all classes, I think this would bring a lot of value to the uniformness in command outputs.
The one thing that we can still consider is to have some room for some additional custom information based on the particular sub class. It can be argued that for, for example, base Data
types, it is very useful to also show the value
. For example for Int
, having Int<140> 5
maybe a lot more useful than simply Int<140>
and does not contain too much information. However, the exact formatting may not be so easy (especially when it is used in a sentence "Created the node Int<140> 5 successfully" just looks weird.) and the concept is not really applicable to all classes, not even all base data types. Case in point, how do you do this for a Dict
node? Again, I am thinking that consistency across all node classes is worth more than having a slight informational gain in some cases.
TLDR: I propose the following two implementations for all ORM classes:
def __repr__(self):
if self.is_stored:
return f'<{self.__class__}:{self.pk}>'
return f'<{self.__class__}:{self.uuid}>'
def __str__(self):
if self.is_stored:
return f'{self.__class__.__name__}<{self.pk}>'
return f'{self.__class__.__name__}<{self.uuid}>'
Which would look like:
group = Group('some-label)
group
<aiida.orm.groups.group.Group:ec4f4420-af4a-41c2-baa0-8df32f28ecbd>
print(group)
Group<ec4f4420-af4a-41c2-baa0-8df32f28ecbd>
group.store()
group
<aiida.orm.groups.group.Group:123>
print(group)
Group<123>
This seems like a good compromise.
I'd probably add a bit more information about what parameter we're showing in the __str__
output, e.g., Group<pk=123>
or Group<pk:123>
, and instead of using the PK for __repr__
I'd always go with the UUID instead. Even when stored. Either that or it's both PK and UUID when stored (for local simplicity when debugging or similar). And I like the current (unstored)
string for entities as well, so it could be something like <aiida.orm.groups.group.Group | unstored | ec4f4420-af4a-41c2-baa0-8df32f28ecbd>
and then <aiida.orm.groups.group.Group | 123 | ec4f4420-af4a-41c2-baa0-8df32f28ecbd>
?
Concerning the extra information for certain Node
types, especially the base Node
types, I think we should be flexible here and be "friendly". But only with the __str__
representation. Essentially mirroring what the base Python types do as well for this, and try to mimick that output - but within the framework that we've already set up here.
So an example for Int
could be Int<pk=123 | value=5>
or something?
For Dict
it would be Dict<pk=321 | value={'my_key': 'my-value', 'my_nested-dict': {'see me nesting': 'I\'m a bird'}}>
. While this makes for a large output, it also makes a lot of sense in the context of the Node
type to have everything. One can even argue to just print the exact same as would be printed for a dict
, but I personally like the reminder that this is not a dict
but an AiiDA Dict
node...
As an addition, that might help (maybe already mentioned?): note that nodes also have a get_description()
method that returns a much longer description. Long information might be put there (and we can advertise it more as it's not a standard feature). Also, as discussed elsewhere (I don't remember where...) we could/should also add some methods so that if people are using the classes in jupyter, they see some nicer HTML output where we can use colors, fonts, etc. - this would work without them having to know much, and it's nicer graphically.
I'd probably add a bit more information about what parameter we're showing i
Don't think this is necessary. The beauty of only ever showing the PK or UUID is that this will always make it implicitly, yet unambiguously, clear which is being shown because they are guaranteed to have different formats.
For
Dict
it would beDict<pk=321 | value={'my_key': 'my-value', 'my_nested-dict': {'see me nesting': 'I\'m a bird'}}>
. While this makes for a large output, it also makes a lot of sense in the context of theNode
type to have everything.
Again, I would explicitly not do this. I think that we should define __str__
such that we can always use it effectively in CLI commands. Once you start adding these long outputs, that simply is no longer the case. Just for example, take verdi node comment
, where you probably want to say at the end something like:
echo.echo_success(f'Added comment to {node}')
All I want to see here is
Success: Added comment to Node<15123>
If this happens to be an output_parameters
Dict node from a calcjob, with your suggestion you would get
Success: Added comment to Node<pk=321 | value={"absolute_magnetization": 2.58,"absolute_magnetization_units": "Bohrmag / cell""atomic_charges_units": "e""atomic_magnetic_moments_units": "Bohrmag / cell""beta_real_space": false,"charge_density": "./charge-density.dat""constraint_mag": 0,"convergence_info": {"scf_conv": {"convergence_achieved": true,"n_scf_steps": 20,"scf_error": 1.1066269391433e-11} },"creator_name": "pwscf""creator_version": "6.5""degauss": 0.136056917253,"dft_exchange_correlation": "PBE""do_magnetization": true,"do_not_use_time_reversal": false,"energy": -8959.052366436,"energy_accuracy": 2.993252179566e-10,"energy_accuracy_units": "eV""energy_ewald": -4799.3964595658,"energy_ewald_units": "eV""energy_hartree": 1739.8835448164,"energy_hartree_units": "eV""energy_one_center_paw": -2072.2925053599,"energy_one_center_paw_units": "eV""energy_one_electron": -2881.2005051896,"energy_one_electron_units": "eV""energy_smearing": -0.00081552516201448,"energy_smearing_units": "eV""energy_threshold": 1.21e-12,"energy_units": "eV""energy_xc": -946.04562574798,"energy_xc_units": "eV""estimated_ram_per_process": 422.38,"estimated_ram_per_process_units": "MB""estimated_ram_total": 844.77,"estimated_ram_total_units": "MB""exit_status": 0,"fermi_energy": 19.3607,"fermi_energy_units": "eV""fft_grid": [60,60,60 ],"forces_units": "ev / angstrom""format_name": "QEXSD""format_version": "19.03.04""has_dipole_correction": false,"has_electric_field": false,"init_wall_time_seconds": 84.8,"inversion_symmetry": true,"lattice_symmetries": [],"lda_plus_u_calculation": false,"lkpoint_dir": false,"lsda": true,"magnetization_angle1": [0.0,0.0 ],"magnetization_angle2": [0.0,0.0 ],"monkhorst_pack_grid": [23,23,23 ],"monkhorst_pack_offset": [0,0,0 ],"no_time_rev_operations": false,"non_colinear_calculation": false,"number_of_atomic_wfc": 26,"number_of_atoms": 2,"number_of_bands": 20,"number_of_bands_down": 20,"number_of_bands_up": 20,"number_of_bravais_symmetries": 48,"number_of_electrons": 32.0,"number_of_k_points": 364,"number_of_species": 2,"number_of_spin_components": 2,"number_of_symmetries": 48,"occupations": "smearing""q_real_space": false,"rho_cutoff": 14694.147063324,"rho_cutoff_units": "eV""scf_iterations": 20,"smearing_type": "mv""smooth_fft_grid": [32,32,32 ],"spin_orbit_calculation": false,"spin_orbit_domag": true,"starting_magnetization": [-0.25,0.25 ],"stress_units": "GPascal""symmetries": [{"symmetry_number": 0,"t_rev": "0"},{"symmetry_number": 1,"t_rev": "0"},{"symmetry_number": 2,"t_rev": "0"},{"symmetry_number": 3,"t_rev": "0"},{"symmetry_number": 4,"t_rev": "0"},{"symmetry_number": 5,"t_rev": "0"},{"symmetry_number": 6,"t_rev": "0"},{"symmetry_number": 7,"t_rev": "0"},{"symmetry_number": 8,"t_rev": "0"},{"symmetry_number": 9,"t_rev": "0"},{"symmetry_number": 10,"t_rev": "0"},{"symmetry_number": 11,"t_rev": "0"},{"symmetry_number": 12,"t_rev": "0"},{"symmetry_number": 13,"t_rev": "0"},{"symmetry_number": 14,"t_rev": "0"},{"symmetry_number": 15,"t_rev": "0"},{"symmetry_number": 16,"t_rev": "0"},{"symmetry_number": 17,"t_rev": "0"},{"symmetry_number": 18,"t_rev": "0"},{"symmetry_number": 19,"t_rev": "0"},{"symmetry_number": 20,"t_rev": "0"},{"symmetry_number": 21,"t_rev": "0"},{"symmetry_number": 22,"t_rev": "0"},{"symmetry_number": 23,"t_rev": "0"},{"symmetry_number": 32,"t_rev": "0"},{"symmetry_number": 33,"t_rev": "0"},{"symmetry_number": 34,"t_rev": "0"},{"symmetry_number": 35,"t_rev": "0"},{"symmetry_number": 36,"t_rev": "0"},{"symmetry_number": 37,"t_rev": "0"},{"symmetry_number": 38,"t_rev": "0"},{"symmetry_number": 39,"t_rev": "0"},{"symmetry_number": 40,"t_rev": "0"},{"symmetry_number": 41,"t_rev": "0"},{"symmetry_number": 42,"t_rev": "0"},{"symmetry_number": 43,"t_rev": "0"},{"symmetry_number": 44,"t_rev": "0"},{"symmetry_number": 45,"t_rev": "0"},{"symmetry_number": 46,"t_rev": "0"},{"symmetry_number": 47,"t_rev": "0"},{"symmetry_number": 48,"t_rev": "0"},{"symmetry_number": 49,"t_rev": "0"},{"symmetry_number": 50,"t_rev": "0"},{"symmetry_number": 51,"t_rev": "0"},{"symmetry_number": 52,"t_rev": "0"},{"symmetry_number": 53,"t_rev": "0"},{"symmetry_number": 54,"t_rev": "0"},{"symmetry_number": 55,"t_rev": "0"} ],"symmetries_units": "crystal""time_reversal_flag": true,"total_force": 0.0,"total_force_units": "ev / angstrom""total_magnetization": 0.0,"total_magnetization_units": "Bohrmag / cell""total_number_of_scf_iterations": 20,"volume": 21.839617220476,"wall_time": " 1h13m ","wall_time_seconds": 4380.0,"wfc_cutoff": 1224.512255277,"wfc_cutoff_units": "eV"
}>
How is this useful? I know I picked an extreme example, but that is just the point. There is no way to make this work for all node types. All we really need is the UUID or pk. There are all sorts of other commands to figure out more information, should you need to, using that identifier.
That being said, I am not saying there shouldn't be some other method that provides richer information. Like @giovannipizzi mentioned, there is the get_description
that exists for certain classes that might be a better solution for this. Or even better, indeed some way to provide any number of rich formats with detailed information. The difference is that these should be called explicitly. I think the default display (as provided by __str__
) should be as consistent and as well-suited for inline use in text as possible.
@chrisjsewell 's graphviz PR #2596 needs to know which "short description" should be shown for a given AiiDA node.
It makes sense that this information be provided by the node itself rather than e.g. by the graph visualization tool.
Status quo
There are (at least) two built-in python functions used for this:
__str__
and__repr__
Their implementations in AiiDA often differ - see below:
Proposal
The python docs say on
__repr__
So, I believe a value like
<Int: value: 30, uuid: 15f11fcf-02fa-4bbf-b3ef-ee1139f3206c (unstored) >
makes sense here.The python docs say on
__str__
Do we even need to show the UUID here? This could be the description that Chris uses, e.g.
value: 30
.Even if we decide to show the UUID in both cases, I think we should factor out the
value: 30
part so that it can be reused by__str__
, by__repr__
and by @chrisjsewell .@sphuber @giovannipizzi Please let me know your take on this.