aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
437 stars 192 forks source link

Duplicate CREATE links when importing old export file #4125

Open greschd opened 4 years ago

greschd commented 4 years ago

While importing an old export file (export version 0.3), the migrations run through without error, but the actual import raises:

Error: an exception occurred while importing the archive export-migrated.tar.gz
Traceback (most recent call last):
  File "/home/a-dogres/.virtualenvs/aiida_data_import/lib/python3.7/site-packages/aiida/cmdline/commands/cmd_import.py", line 65, in _try_import
    import_data(file_to_import, group, **kwargs)
  File "/home/a-dogres/.virtualenvs/aiida_data_import/lib/python3.7/site-packages/aiida/tools/importexport/dbimport/__init__.py", line 73, in import_data
    return import_data_dj(in_path, group=group, silent=silent, **kwargs)
  File "/home/a-dogres/.virtualenvs/aiida_data_import/lib/python3.7/site-packages/aiida/tools/importexport/dbimport/backends/django/__init__.py", line 590, in import_data_dj
    source.uuid, link_type, link['label']
aiida.tools.importexport.common.exceptions.ImportValidationError: Node<c4e6207a-c205-47d8-9b52-97c94d08657d> already has an outgoing LinkType.CREATE link with label "parameters"

This is due to the "unique pair" check, here: https://github.com/aiidateam/aiida-core/blob/develop/aiida/tools/importexport/dbimport/backends/django/__init__.py#L656

I can't quite remember, was it allowed in a previous version of AiiDA to have multiple outgoing link labels of the same name? If so, what would be a good way to resolve this issue?

In this particular case I can just patch the code to rename the links (since I don't care too much about their names), but in general that might be dangerous.

The export I'm trying to import here can be found at https://polybox.ethz.ch/index.php/s/I1se1WGAiP2iLYX, but it's large and also suffers from #3450 (which I fixed with a sed on the data.json) - so it's not great for testing on.

giovannipizzi commented 4 years ago

Pinging @CasperWA for comments as he's been working a lot with the migrations of export files

@greschd could you please quickly check all the outgoing links (type and label pairs) and report them here? In the past there was the possibility of both a return and create outgoing links from the same node. However, the migration tools should detect this specific case and not complain, so it's maybe something else

greschd commented 4 years ago

From the data.json, I checked the links where (input, label, type) is not unique:

I think the code producing these issues is here, although I'm not 100% sure this is the right commit version: https://github.com/greschd/aiida-tbextraction/blob/034f8a6ceaa300118064f1eb287037e310f3b45f/aiida_tbextraction/fp_run/_helpers/_inline_calcs.py

The labels of the offending links are {'bands', 'kpoints', 'parameters', 'result', 'wannier_parameters'}.

Also, I was using a sketchy development version of AiiDA at the time - it's entirely possible this is some edge case related to that.

CasperWA commented 4 years ago

We've tried to tackle the most general patches of these, but to get just right for a particular graph it ususally takes a bit of manual labour, since the desired outcome may vary from owner to owner, e.g., you don't care about the labels, others might.

Since this migration is most likely due to migrating from v0.3 to v0.4 of the export version, we can try to do this the "hard" way, by migrating to only this version first, then going over the export data.json file and adapt whatever you may need / try to implement your use case in the migration.

There might also be a log file somewhere with the list of links affected (perhaps in the folder where from where you ran the migrate/export command?

greschd commented 4 years ago

Yeah, the problem here is that the links aren't affected by the migration at all - so they also don't appear in the migration-<id>.log files.

It appears to me the "inline calculation" produced duplicates of the output nodes, which are both linked with the same label.

I'll give it a quick try at reproducing this with a clean 1.0.0a1. If that doesn't work I think we can close this - I can get around it for my own use case, and unless someone else reports the same problem it might just as well be due to the wonky setup I had at the time.

greschd commented 4 years ago

Found the culprit... image it's caching.

In 1.0.0a1, running an InlineCalculation with caching enabled produced two output links, one to the "initial" output and one to a cloned duplicate.

If this is the only place where this can occur I'd be tempted to mark this wontfix - it's an alpha release, and I was pretty much the only one using caching back then.

However, this made me remember how you could have duplicate CREATE links in previous versions: Outputs did not have to be unique, but then you needed to access them with linkname_<pk> to get a specific one:

In [17]: inline_calc.get_outputs_dict()
Out[17]: 
{u'res': <Data: uuid: 52bed73b-cabc-49ce-aae1-588479454123 (pk: 15)>,
 u'res_15': <Data: uuid: 52bed73b-cabc-49ce-aae1-588479454123 (pk: 15)>,
 u'res_16': <Data: uuid: c4005890-1bb5-47f4-b1fb-eeb85cfe0144 (pk: 16)>}

In [18]: inline_calc.out.res
Out[18]: <Data: uuid: 52bed73b-cabc-49ce-aae1-588479454123 (pk: 15)>

In [19]: inline_calc.out.res_15
Out[19]: <Data: uuid: 52bed73b-cabc-49ce-aae1-588479454123 (pk: 15)>

In [20]: inline_calc.out.res_16
Out[20]: <Data: uuid: c4005890-1bb5-47f4-b1fb-eeb85cfe0144 (pk: 16)>

The link labels don't contain the _<pk> suffix:

In [28]: for link in inline_calc.dbnode.output_links.all():
    ...:     print(link, link.label)
    ...:     
(<DbLink: InlineCalculation (14) --> Float (16)>, u'res')
(<DbLink: InlineCalculation (14) --> Float (15)>, u'res')

Here's the link to the relevant documentation in 0.12.3 (the "note" at the very bottom): https://aiida.readthedocs.io/projects/aiida-core/en/v0.12.3/working_with_aiida/resultmanager.html

If memory serves me right, this could happen in any calculation / workchain / whatever, and the caching issue isn't actually creating an invalid link state as far as the old AiiDA version is concerned.

Maybe a way to fix this would be explicitly adding the _<pk> suffix to these link labels - with the "bare" link referring to the lowest-pk descendant (I think this was how it was decided, but that would need checking in the old code).

giovannipizzi commented 4 years ago

Hi! Great that you tracked this! The _pk was added only in python when using node.out. as, for outgoing links of data nodes, uniqueness is not required (as you noted also above).

Anyway - I agree. Let's set this to wontfix unless someone else has similar issues!

greschd commented 4 years ago

Sorry, I wasn't being totally clear: Since the caching / inline calcs are in fact not the only place where this can occur I would suggest not to mark this as wontfix.

The uniqueness was not required in the previous version, but now you can only have one CREATE link with the same label. I think there should be some workaround for this, since it is "normal" use of the previous AiiDA version.

P.S.: Even if we decide to mark it wontfix, the bug label would still apply IMO.

greschd commented 4 years ago

Hmm, I think I misunderstood something: Duplicate outgoing links with the same label still appear to be valid:

In [10]: from aiida.orm import CalcJobNode

In [11]: from aiida.orm import Data

In [12]: calc = CalcJobNode()

In [13]: d1 = Data()

In [14]: d2 = Data()

In [15]: d1.add_incoming(calc, link_type=LinkType.CREATE, link_label='test')

In [16]: d2.add_incoming(calc, link_type=LinkType.CREATE, link_label='test')

So.. is this just an extra validation in the import that actually shouldn't be there?

greschd commented 4 years ago

Ha, just figured out how I tricked the link validation in the example above: Links only show up in get_outgoing after the target has been stored

In [16]: calc = CalcJobNode()

In [17]: calc.store()
Out[17]: <CalcJobNode: uuid: 6e3b970f-cc00-465e-a6c5-67507a3a7740 (pk: 49831)>

In [18]: d1 = Data()

In [19]: d1.add_incoming(calc, link_type=LinkType.CREATE, link_label='test')

In [20]: calc.get_outgoing(link_type=LinkType.CREATE, link_label_filter='test', only_uuid=True).all()
Out[20]: []

In [21]: d1.store()
Out[21]: <Data: uuid: a776a906-dbfc-48d8-8b56-ac2627b07a28 (pk: 49832)>

In [22]: calc.get_outgoing(link_type=LinkType.CREATE, link_label_filter='test', only_uuid=True).all()
Out[22]: [LinkTriple(node='a776a906-dbfc-48d8-8b56-ac2627b07a28', link_type=<LinkType.CREATE: 'create'>, link_label='test')]

This is relevant here because validate_links uses get_outgoing to check for existing links, see https://github.com/aiidateam/aiida-core/blob/develop/aiida/orm/utils/links.py#L163

I'm not sure if

Tagging @sphuber for comment.

sphuber commented 4 years ago

I am confused. I thought this behavior was not allowed and therefore your example should be a bug. If you take your example and then store the nodes, there will be no exception and we have two links that violate the uniqueness constraints:

In [4]: from aiida.orm import CalcJobNode 
   ...: from aiida.orm import Data 
   ...: calc = CalcJobNode() 
   ...: d1 = Data() 
   ...: d2 = Data() 
   ...: d1.add_incoming(calc, link_type=LinkType.CREATE, link_label='test') 
   ...: d2.add_incoming(calc, link_type=LinkType.CREATE, link_label='test') 
   ...: calc.store()                                                                                                                                                                                 
Out[4]: <CalcJobNode: uuid: 94def715-33ab-4065-b750-aca73266d0f0 (pk: 11081)>

In [5]: calc.get_outgoing().all()                                                                                                                                                                                
Out[5]: []

In [6]: d1.store()                                                                                                                                                                                               
Out[6]: <Data: uuid: 023c9af7-328d-43ca-8ef5-05f4f0bb773c (pk: 11082)>

In [7]: calc.get_outgoing().all()                                                                                                                                                                                
Out[7]: [LinkTriple(node=<Data: uuid: 023c9af7-328d-43ca-8ef5-05f4f0bb773c (pk: 11082)>, link_type=<LinkType.CREATE: 'create'>, link_label='test')]

In [8]: d2.store()                                                                                                                                                                                               
Out[8]: <Data: uuid: e24e1f7f-6337-469b-a624-c1d13aa1c8eb (pk: 11083)>

In [9]: calc.get_outgoing().all()                                                                                                                                                                                
Out[9]: 
[LinkTriple(node=<Data: uuid: 023c9af7-328d-43ca-8ef5-05f4f0bb773c (pk: 11082)>, link_type=<LinkType.CREATE: 'create'>, link_label='test'),
 LinkTriple(node=<Data: uuid: e24e1f7f-6337-469b-a624-c1d13aa1c8eb (pk: 11083)>, link_type=<LinkType.CREATE: 'create'>, link_label='test')]

both links are stored but that should not have been allowed. That can be shown by calling calc.outputs which expects the labels to be unique:

In [10]: calc.outputs.test                                                                                                                                                                                       
---------------------------------------------------------------------------
MultipleObjectsError                      Traceback (most recent call last)
<ipython-input-10-6f9a623c1bc3> in <module>
----> 1 calc.outputs.test

~/code/aiida/env/dev/aiida-core/aiida/orm/utils/managers.py in __getattr__(self, name)
     81         """
     82         try:
---> 83             return self._get_node_by_link_label(label=name)
     84         except NotExistent:
     85             # Note: in order for TAB-completion to work, we need to raise an

~/code/aiida/env/dev/aiida-core/aiida/orm/utils/managers.py in _get_node_by_link_label(self, label)
     62         if self._incoming:
     63             return self._node.get_incoming(link_type=self._link_type).get_node_by_label(label)
---> 64         return self._node.get_outgoing(link_type=self._link_type).get_node_by_label(label)
     65 
     66     def __dir__(self):

~/code/aiida/env/dev/aiida-core/aiida/orm/utils/links.py in get_node_by_label(self, label)
    296                 else:
    297                     raise exceptions.MultipleObjectsError(
--> 298                         'more than one neighbor with the label {} found'.format(label)
    299                     )
    300 

MultipleObjectsError: more than one neighbor with the label test found

Weirdly, when looking at the tests in tests/orm/test_node.py we find the following test:

    def test_node_outdegree_unique_triple(self):
        """Test that the validation of links with outdegree `unique_triple` works correctly

        The example here is a `CalculationNode` that has two outgoing CREATE links with the same label, but to different
        target nodes. This is legal and should pass validation.
        """
        creator = CalculationNode().store()
        data_one = Data()
        data_two = Data()

        # Verify that adding two create links with the same link label but to different target is allowed from the
        # perspective of the source node (the CalculationNode in this case)
        data_one.add_incoming(creator, link_type=LinkType.CREATE, link_label='create')
        data_two.add_incoming(creator, link_type=LinkType.CREATE, link_label='create')
        data_one.store()
        data_two.store()

        uuids_outgoing = set(node.uuid for node in creator.get_outgoing().all_nodes())
        uuids_expected = set([data_one.uuid, data_two.uuid])
        self.assertEqual(uuids_outgoing, uuids_expected)

which tests exactly the example here and claims this should be find. However, the link validation code in aiida.orm.utils.links.validate_link we see:

    link_mapping = {
        LinkType.CALL_CALC: (WorkflowNode, CalculationNode, 'unique_triple', 'unique'),
        LinkType.CALL_WORK: (WorkflowNode, WorkflowNode, 'unique_triple', 'unique'),
        LinkType.CREATE: (CalculationNode, Data, 'unique_pair', 'unique'),
        LinkType.INPUT_CALC: (Data, CalculationNode, 'unique_triple', 'unique_pair'),
        LinkType.INPUT_WORK: (Data, WorkflowNode, 'unique_triple', 'unique_pair'),
        LinkType.RETURN: (WorkflowNode, Data, 'unique_pair', 'unique_triple'),
    }

clearly stating that CalculationNode -> Data should be unique_pair so target node + link label should be unique. I think the test is just wrong and it is not something we want to allow and there is a bug in the validation which I haven't found yet. Will report later.

greschd commented 4 years ago

bug in the validation

The validation relies on the outgoing node being stored because it uses get_outgoing.

greschd commented 4 years ago

I agree that the test is probably just wrong, since it's also pretty impossible to get into this situation with "normal" user code (workchains, calcjobs, etc.).