First Implemetation of export cache functionality

broeder-j commented 4 years ago

Hi, I put this here for review, @PhilippRue and @greschd since my time to work on this is sparse. The code might be not totally fine yet, I left both fixture versions (run_with_cache, with_export_cache) side by side. Maybe we should do a first merge, that we can continue to work on smaller things at a time.

After the slack discussion: The fixture run_with_cache was split, that one can use export_cache and load_cache alone. There is are simple test for these. In order for the run_with_cache fixture test to work, I monkey patched code.get_object_to_hash and calcjob.get_object_to_hash (since it used the computer uuid). there is a fixture for this. thanks greschd for this suggestion.

PhilippRue has solved this issue different by introducing the aiida-core and mock code with a session scope instead of function and imports of computers and codes. In aiida-kkr he has running examples of with run_with_cache. This may not be a solution for several developers.

The builder hash now supports nested namespaces.

Further I have implemented also a second fixture to run with exports where you have to specify the export to use, feature suggestions to this by greschd:

allow for multiple calculation classes (currently all or one, maybe with_caching in aiida-core has to support it first)
making sure computer/ code hash to the same value for different machines (should work with the monkeypatch fixture, there is a test for this)

Known issue: Sometimes the caching when running does not work (which causes the tests to fail), which I do not understand the hash of the imported and running calcjob is the same. @greschd maybe you spot the reason for this.

Then the fixture run_with_cache should also work with yield
maybe the diff workchain for testing should not be in the test file?

I left both fixture versions in, since I would like the export name to be choosen in an automatic way per default. fix #20

broeder-j commented 4 years ago

ok the calcjob hash on the CI is different...

greschd commented 4 years ago

Thanks! I will give this a more thorough review later, but regarding the first point: We can definitely merge this even if it's not 100% ready yet, since the same is true for the package as a whole.

Let me know if you want me to take a look at the CI issue.

broeder-j commented 4 years ago

Also I am not sure how to resolve these pre-commit mypy issues:

tests/export_cache/test_export_cache.py:23: error: Class cannot subclass 'WorkChain' (has type 'Any')
tests/export_cache/test_export_cache.py:23: error: Base type WorkChain becomes "Any" due to an unfollowed import
aiida_testing/export_cache/_fixtures.py:315: error: Need type annotation for 'union_pk' (hint: "union_pk: Set[<type>] = ...")
Found 3 errors in 2 files (checked 13 source files)

greschd commented 4 years ago

I've added comments on how to fix the mypy issues - but I'm also not 100% convinced if mypy is a net benefit here, given that I also have limited experience with it so far.

This depends on whether you (and the other contributors) want to get acquainted with mypy, or if it's mostly a nuisance.

Tagging @PhilippRue @ltalirz for comment.

ltalirz commented 4 years ago

Up to you, guys. I've never used mypy, but now that python 3.5 is the lowest python version we support, and 3.6 will be it in September, I would say it's only a matter of time for type hints to enter aiida-core (they would certainly be welcome!).

Since there seem to be a couple of different flavors (+ changes again in 3.6), I'd appreciate it if someone who has experience would write a short AEP on how we should go about this.

greschd commented 4 years ago

Well, then maybe aiida-testing is a decent testing ground for that, too.

With "different flavors", do you mean external libraries? In the standard, I'm only aware of the typing module - and newer Python versions just add to that (and where hints are allowed).

ltalirz commented 4 years ago

Right, I just meant the different places for type hints - there is PEP 483, 484, 526, 3107 etc. - not external libraries. I guess it's quite straightforward if you've used it once.

greschd commented 4 years ago

Oh, I see. The way I understand it so far is these PEPs mostly build upon each other.. so there's no competition between flavors, just different levels of what's possible.

broeder-j commented 4 years ago

concerning mypy. it would be nice if one could ignore all types injected from external libary imports. Which is the only think I find annoying. Also the type annotations make things less readable in my opinion. Through I see the benefits.

greschd commented 4 years ago

The external library imports can be ignored by adding an entry in https://github.com/aiidateam/aiida-testing/blob/develop/mypy.ini, for each import that should be ignored.

Ignoring all external imports is explicitly discouraged by mypy, because it can hide actual errors.

broeder-j commented 4 years ago

ok towards that the hash on the CI differs: objects to hash:

 [{'withmpi': False, 'resources': {'num_machines': 1, 'num_mpiprocs_per_machine': 1}, 'append_text': '', 'parser_name': 'diff', 'prepend_text': '', 'output_filename': 'patch.diff', 'scheduler_stderr': '_scheduler-stderr.txt', 'scheduler_stdout': '_scheduler-stdout.txt', 'mpirun_extra_params': [], 'environment_variables': {}, 'import_sys_environment': True, 'custom_scheduler_commands': ''}, {'file2': 'f744e4551095439a50a78488b89bf85e6b9bd347d5f99bbfd2fde64954cf91c5', 'file1': 'f6d1c358fb7a03d0831faf710b39ea945911fddab259be85ed0d25a51e62c992', 'parameters': '57b7298bdbd24a13add102c387767d645f9c8e0a86a9d0b1ab6492bda59e0758', 'code': 'bce6cfbda7e548b4af0f27cf5cd39a1245b21f3f6c85a0b5a9278ccc7f6837cc'}]

[{'withmpi': False, 'resources': {'num_machines': 1, 'num_mpiprocs_per_machine': 1}, 'append_text': '', 'parser_name': 'diff', 'prepend_text': '', 'output_filename': 'patch.diff', 'scheduler_stderr': '_scheduler-stderr.txt', 'scheduler_stdout': '_scheduler-stdout.txt', 'mpirun_extra_params': [], 'environment_variables': {}, 'import_sys_environment': True, 'custom_scheduler_commands': ''}, {'file2': 'a403c9809fc94ab203da6536a184e194a7856d50e73369345e26bb9ab1f39e5a', 'file1': 'ee69f1c4e4caefadf18fec511d6c780dbfdbcc7ad8b8bfcb4a954b3851b0f654', 'parameters': 'a5045df6d3d8b067897494f3facf13b340316569c13ac747b1c611fbf7e19725', 'code': 'bce6cfbda7e548b4af0f27cf5cd39a1245b21f3f6c85a0b5a9278ccc7f6837cc'}]

The parameters and the singlefile have a different hash on the CI and my computer. @greschd why? is the abs path included in the hash for single file nodes, or is the repo location included for all nodes?

greschd commented 4 years ago

Hmm, I'm not sure about this..

I think the abspath is not responsible, because you can .clone() a SingleFileData also locally, and get the same hash. The repository location shouldn't be in the hash, but I have never done testing across different AiiDA installations / profiles.

In principle there could be a problem with uppercase / lowercase file names (depending on the file system) due to the Folder hashing, which is implemented here: https://github.com/aiidateam/aiida-core/blob/develop/aiida/common/hashing.py#L260

I guess a helper to figure out which parts of the hash would be quite convenient right now (see https://github.com/aiidateam/aiida-core/issues/2549), but for now: Can you also print _get_objects_to_hash of the files and parameters?

greschd commented 4 years ago

Just tried creating the same SinglefileData on two different systems (with different repo locations), and I'm also getting the same hash.

PhilippRue commented 4 years ago

On my system the tests work fine with these changes (compared to commit 97cc9c6 type def and ignore for mypy, reverted previous change, excluded aiida version from calcjob hash (4 hours ago) <Jens Bröder>) all tests pass.

diff --git a/aiida_testing/export_cache/_fixtures.py b/aiida_testing/export_cache/_fixtures.py
index c8a4362..13ef977 100644
--- a/aiida_testing/export_cache/_fixtures.py
+++ b/aiida_testing/export_cache/_fixtures.py
@@ -159,7 +159,8 @@ def load_cache(hash_code_by_entrypoint):
         # currently this should only cache CalcJobNodes
         qub = QueryBuilder()
         qub.append(ProcessNode)  # query for all ProcesNodes
-        qub.append(Code)
+        # this breaks it:
+        #qub.append(Code)
         to_hash = qub.all()
         for node1 in to_hash:
             node1[0].rehash()
@@ -222,7 +223,8 @@ def hash_code_by_entrypoint(monkeypatch):
         """
         Return a list of objects which should be included in the hash of a Code node
         """
-        return [self.get_attribute(key='input_plugin'), self.get_computer_name()]
+        return [self.get_attribute(key='input_plugin')]
+        #return [self.get_attribute(key='input_plugin'), self.get_computer_name()]

     def mock_objects_to_hash_calcjob(self):
         """
diff --git a/tests/export_cache/caches/diff_workchain.tar.gz b/tests/export_cache/caches/diff_workchain.tar.gz
index ee313ac..795c096 100644
Binary files a/tests/export_cache/caches/diff_workchain.tar.gz and b/tests/export_cache/caches/diff_workchain.tar.gz differ
diff --git a/tests/export_cache/test_export_cache.py b/tests/export_cache/test_export_cache.py
index 98eabab..86c22c7 100644
--- a/tests/export_cache/test_export_cache.py
+++ b/tests/export_cache/test_export_cache.py
@@ -111,7 +111,8 @@ def test_mock_hash_codes(mock_code_factory, clear_database, hash_code_by_entrypo
         ignore_files=('_aiidasubmit.sh', 'file*')
     )
     objs = mock_code._get_objects_to_hash()
-    assert objs == [mock_code.get_attribute(key='input_plugin'), mock_code.get_computer_name()]
+    assert objs == [mock_code.get_attribute(key='input_plugin')]
+    #assert objs == [mock_code.get_attribute(key='input_plugin'), mock_code.get_computer_name()]

 @pytest.mark.timeout(60, method='thread')
@@ -153,7 +154,7 @@ def test_run_with_cache(
     #Test if cache was used?
     diffjob = node.get_outgoing().get_node_by_label('CALL')
     cache_src = diffjob.get_cache_source()
-    calc_hash_s = '833070207922ee3dccbc477db7bbbf871d9df68bef2624604144f76b0edc6f0a'
+    calc_hash_s = '0c048c64551bd0b099ed82a90ff2c36833a3f1d0620712e7059367a67dac5c91'
     calc_hash = diffjob.get_hash()
     assert calc_hash == calc_hash_s
     assert cache_src is not None
@@ -200,7 +201,7 @@ def test_with_export_cache(
     #Test if cache was used?
     diffjob = node.get_outgoing().get_node_by_label('CALL')
     cache_src = diffjob.get_cache_source()
-    calc_hash_s = '833070207922ee3dccbc477db7bbbf871d9df68bef2624604144f76b0edc6f0a'
+    calc_hash_s = '0c048c64551bd0b099ed82a90ff2c36833a3f1d0620712e7059367a67dac5c91'
     calc_hash = diffjob.get_hash()
     assert calc_hash == calc_hash_s
     assert cache_src is not None

PhilippRue commented 4 years ago

Note that changing the monkeypatch for the code hash is needed if one first imports an export file that contains the code because this gets the (Imported ) addition to the name

broeder-j commented 4 years ago

I exclude the version now from most aiida data types. had to monkeypatch, because the version is doubled in there. now it works. ;-). Or do we wont the tests to fail if the aiida version changes? If I have done it right, for pacakge dependend datatypes which do not inherit from Dict, RemoteData, SinglefileData the version check is still in. Maybe this can be done smarter, also basetypes and arrays are not patched yet. Or one could patch node in general to always exlcude the version.

greschd commented 4 years ago

Hmm, maybe ignoring the (aiida or plugin) version can be a separate fixture? That way, a plugin developer can decide - if you always want to have the fixture, all you need to do is create a "dummy" fixture which uses it, and has autouse=True set.

PhilippRue commented 4 years ago

Hmm, maybe ignoring the (aiida or plugin) version can be a separate fixture? That way, a plugin developer can decide - if you always want to have the fixture, all you need to do is create a "dummy" fixture which uses it, and has autouse=True set.

I agree, now the hash_code_by_entrypoint fixture does more than just change hashing for the code (e.g. change it for SingleFileData as well).

PhilippRue commented 4 years ago

Also we might want to provide the option for the with_export_cache to overwrite the output file. Now the export file is only written if no previous file is found.

This could be helpful for debugging/developing workflows (see https://groups.google.com/forum/?hl=en#!topic/aiidausers/y-ekFBlLgsg from mailing list).

broeder-j commented 4 years ago

Also we might want to provide the option for the with_export_cache to overwrite the output file. Now the export file is only written if no previous file is found.

This could be helpful for debugging/developing workflows (see https://groups.google.com/forum/?hl=en#!topic/aiidausers/y-ekFBlLgsg from mailing list).

Done

greschd commented 4 years ago

@broeder-j @PhilippRue I think we can do the following: We merge this PR now - I will go over the functionality and try to implement the "node selection by what ran in the context manager" discussed above. After that I will create a new PR (possibly with some interface changes...), and have the two of you review it / test it on your plugins.

Does that work for you?

PhilippRue commented 4 years ago

@broeder-j @PhilippRue I think we can do the following: We merge this PR now - I will go over the functionality and try to implement the "node selection by what ran in the context manager" discussed above. After that I will create a new PR (possibly with some interface changes...), and have the two of you review it / test it on your plugins.

Does that work for you?

sounds good 👍

greschd commented 4 years ago

Since it seemed to be incompatible with the changes in #30, I merged it into a separate branch for now (to keep the tests running on develop).

aiidateam / aiida-testing

First Implemetation of export cache functionality #29