aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
431 stars 187 forks source link

Formalize the mapping between nested process port namespaces to flat link labels #2688

Closed sphuber closed 5 years ago

sphuber commented 5 years ago

The Process and ProcessNode duality forces us to define a mapping between the nested port namespaces of the process and the flat link label hierarchy of the corresponding node in the provenance graph. A process can define a nested port namespace through its spec:

class NestedProcess(Process):

    @classmethod
    def define(cls, spec):
        super(NestedProcess, cls).define(spec)
        spec.input('name.spaced.input')
        spec.output('name.spaced.output')

If we print the inputs port namespace, we get the following:

In [6]: print(NestedProcess.spec().inputs)
{
    "_attrs": {
        "default": [], 
        "dynamic": false, 
        "help": null, 
        "required": "True", 
        "valid_type": "None"
    },
    "name": {
        "_attrs": {
            "default": [], 
            "dynamic": false, 
            "help": null, 
            "required": "True", 
            "valid_type": "None"
        }, 
        "spaced": {
            "_attrs": {
                "default": [], 
                "dynamic": false, 
                "help": null, 
                "required": "True", 
                "valid_type": "None"
            }, 
            "input": {
                "name": "input", 
                "non_db": "False", 
                "required": "True"
            }
        }
    }
}

When constructing this process, or any process, the inputs are expected to map exactly onto this namespace. So for example, inputs = {'name': {'spaced': {'input': 1}}} would be valid. The same principles go for outputs and they should be registered in such a way that it maps one-to-one on the port namespace. For the duration of the process, the inputs and outputs stay organized in this nested hierarchy. However, finally the process needs to be represented in the provenance graph by a node, and the inputs and outputs should be linked up. The link label, which is the only thing that can be used to represent the namespacing, is a string and therefore necessarily one-dimensional. The nested namespace therefore needs to be mapped onto this flat space.

Currently, the namespaces are just concatenated with underscores. So the input node in the example would get the link label name_spaced_input. However, now it becomes indistinguishable from an input that was passed to a process with the name name_spaced_input in the top level namespace.

We should clearly define the rules for the mapping between nested namespaces and flat link labels. This should also keep in mind that if there is any risk of overlapping labels when expanding namespaces. For example, in the current situation, what would happen for the following input port namespace:

class NestedProcess(Process):

    @classmethod
    def define(cls, spec):
        super(NestedProcess, cls).define(spec)
        spec.input('name.spaced.input')
        spec.input('name_spaced_output')

This is currently legal as far as the process spec is concerned. The first is a doubly nested namespace and the latter is a single port in the top level namespace. However, the link labels that will be generated will overlap, making it impossible to link both inputs to the node.

The rules for links should also obey the rules of valid python variable names, as they are used in certain methods for provenance graph traversal, see #2568

Mentioning @giovannipizzi for discussion

giovannipizzi commented 5 years ago

My suggestion:

  1. use double underscore for namespace separation: spec.input('name.spaced.input') becomes name__spaced__input
  2. disallow double underscores from within any port or namespace name (single underscores are allowed except for what is stated below)
  3. disallow a port name starting or ending with an underscore

Something like (didn't test):

def is_valid_name(name):
    if "__" in name:
        return False
    if name.starstwith("_"):
        return False
    if name.endswith("_"):
        return False
    if not(isidentifier(name)): # check e.g. here: https://stackoverflow.com/questions/12700893
        return False
    return True

Also: should we allow any python identifier (including Unicode like, I believe, in python 3)? Personally I would limit only to [a-zA-Z0-9_] with the python additional rules (don't start with a digit, don't be a python keyword like global, from, as, ...).

Number 3. is actually more restrictive than needed - one could either a) just disallow names starting with underscore, OR b) disallow names ending with underscore. In this case, the only thing that can happen is to find three underscores in a row, and knowing which rule between a) and b) we have chosen, reconstruct the links and namespaces. However, I don't think people are using names starting or ending with underscore, and starting with an underscore has a special meaning in python (e.g. node.inputs.<TAB> would not complete these - so I would for now disallow both of them, and in the future we can also partially lift this requirement.

I believe that with the three rules above, the mapping namespaces->link label is an injective function (of course it's not surjective, e.g. you cannot get a label like a______b) but being injective one can always, given a valid link label, invert it to know it if was a variable, a namespace, ...

sphuber commented 5 years ago

I like it and would also vote to, for now, disallow port names starting or ending with an underscore.

The only problem I can see for now is how to deal with a data migration. It will be possible to write a migration that does the right thing 100%. Even though WorkChains are the only processes in 0.12 that had namespacing and so will be the only ones that are affected, many of them will have used underscores in port names. The calculation jobs are not released yet, so as soon as we formalize this, we can send an email to the developers notifying them about the change.

Whether unmigrated data will be a problem depends on what code we will write that depends on a correct interpretation. Some examples I can think of now:

sphuber commented 5 years ago

The new verdi process show command may instead of:

nbands_factor                    1355  Float
relax
final_scf                        1357  Bool
max_meta_convergence_iterations  1352  Int
base
code                             3     Code
parameters                       1347  Dict
kpoints_distance                 1354  Float
pseudo_family                    1346  Str
options                          1345  Dict
max_iterations                   1351  Int
relaxation_scheme                1344  Str
meta_convergence                 1342  Bool
volume_convergence               1353  Float
bands
code                             3     Code
parameters                       1347  Dict
kpoints_distance                 1349  Float
pseudo_family                    1346  Str
options                          1345  Dict
max_iterations                   1350  Int
scf
code                             3     Code
parameters                       1347  Dict
kpoints_distance                 1356  Float
pseudo_family                    1346  Str
options                          1345  Dict
max_iterations                   1348  Int
clean_workdir                    1343  Bool
structure                        106   StructureData

produce something like:

nbands_factor                        1355  Float
relax
    final_scf                        1357  Bool
    max_meta_convergence_iterations  1352  Int
    base
        code                         3     Code
        parameters                   1347  Dict
        kpoints_distance             1354  Float
        pseudo_family                1346  Str
        options                      1345  Dict
        max_iterations               1351  Int
    relaxation_scheme                1344  Str
    meta_convergence                 1342  Bool
    volume_convergence               1353  Float
bands
    code                             3     Code
    parameters                       1347  Dict
    kpoints_distance                 1349  Float
    pseudo_family                    1346  Str
    options                          1345  Dict
    max_iterations                   1350  Int
scf
    code                             3     Code
    parameters                       1347  Dict
    kpoints_distance                 1356  Float
    pseudo_family                    1346  Str
    options                          1345  Dict
    max_iterations                   1348  Int
clean_workdir                        1343  Bool
structure                            106   StructureData