Matgenix / jobflow-remote

jobflow-remote is a Python package to run jobflow workflows on remote resources.
https://matgenix.github.io/jobflow-remote/
Other
16 stars 9 forks source link

Does atomate2 (or any other package specific to a job) need to be installed on the JFR server #41

Open utf opened 7 months ago

utf commented 7 months ago

My setup is as follows:

I thought I would only need to install atomate2 on the HPC remote and local computer, since all code specific to atomate2 will only get executed on these machines. However, when trying to run a workflow in this configuration, I get an error on checkout

jf job info 1
The selected project is val from config file /home/alex/.jfremote/val.yaml
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ created_on = '2023-12-08 11:15'                                                                                                 │
│      db_id = 1                                                                                                                  │
│      index = 1                                                                                                                  │
│   metadata = {}                                                                                                                 │
│       name = 'relax'                                                                                                            │
│    parents = []                                                                                                                 │
│   priority = 0                                                                                                                  │
│     remote = {                                                                                                                  │
│                  'step_attempts': 2,                                                                                            │
│                  'retry_time_limit': '2023-12-08 11:26',                                                                        │
│                  'error': Traceback (most recent call last):                                                                    │
│                File "/home/alex/src/jobflow-remote/src/jobflow_remote/jobs/jobcontroller.py", line 1876, in lock_job_for_update │
│                  yield lock                                                                                                     │
│                File "/home/alex/src/jobflow-remote/src/jobflow_remote/jobs/runner.py", line 259, in advance_state               │
│                  states_methods[state](lock)                                                                                    │
│                File "/home/alex/src/jobflow-remote/src/jobflow_remote/jobs/runner.py", line 266, in upload                      │
│                  job_doc = JobDoc(**doc)                                                                                        │
│                            ^^^^^^^^^^^^^                                                                                        │
│                File "/home/alex/miniconda3/lib/python3.11/site-packages/pydantic/main.py", line 164, in __init__                │
│                  __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)                │
│                File "/home/alex/miniconda3/lib/python3.11/site-packages/monty/json.py", line 263, in validate_monty_v2          │
│                  return cls._validate_monty(__input_value)                                                                      │
│                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                      │
│                File "/home/alex/miniconda3/lib/python3.11/site-packages/monty/json.py", line 242, in _validate_monty            │
│                  new_obj = MontyDecoder().process_decoded(__input_value)                                                        │
│                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                        │
│                File "/home/alex/miniconda3/lib/python3.11/site-packages/monty/json.py", line 482, in process_decoded            │
│                  return cls_.from_dict(data)                                                                                    │
│                         ^^^^^^^^^^^^^^^^^^^^                                                                                    │
│                File "/home/alex/miniconda3/lib/python3.11/site-packages/monty/json.py", line 199, in from_dict                  │
│                  decoded = {k: MontyDecoder().process_decoded(v) for k, v in d.items() if not k.startswith("@")}                │
│                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                │
│                File "/home/alex/miniconda3/lib/python3.11/site-packages/monty/json.py", line 199, in <dictcomp>                 │
│                  decoded = {k: MontyDecoder().process_decoded(v) for k, v in d.items() if not k.startswith("@")}                │
│                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                │
│                File "/home/alex/miniconda3/lib/python3.11/site-packages/monty/json.py", line 444, in process_decoded            │
│                  obj = self.process_decoded(d["@bound"])                                                                        │
│                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                        │
│                File "/home/alex/miniconda3/lib/python3.11/site-packages/monty/json.py", line 477, in process_decoded            │
│                  mod = __import__(modname, globals(), locals(), [classname], 0)                                                 │
│                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                 │
│              ModuleNotFoundError: No module named 'atomate2'                                                                    │
│                                                                                                                                 │
│              }                                                                                                                  │
│      state = 'CHECKED_OUT'                                                                                                      │
│ updated_on = '2023-12-08 11:21'                                                                                                 │
│       uuid = '741295d6-f2f9-4107-967e-6d86615eb300'                                                                             │
│     worker = 'cx'                                                                                                               │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Is this the intended behaviour?

ml-evs commented 7 months ago

I had this question too, I think the answer from @gpetretto / @davidwaroquiers at the time was that the runner is also intended to be run locally (i.e., the JFR server is just for the database). I find this a little counter-intuitive and think we need to do better separating compute environments from workflow environments.

gpetretto commented 7 months ago

I should say that we usually don't have a separation between the "local" machine and the one running the daemon. So up to now we always had only two entities: 1) the machine where the daemon runs and that is used to submit calculations 2) the HPC system.

I guess the problem that you are raising is not so trivial. First, it does not happen during the check_out (since the state is CHECKED_OUT it means that the checkout already happened). The problem is when uploading. See:

           File "/home/alex/src/jobflow-remote/src/jobflow_remote/jobs/runner.py", line 266, in upload                     
              job_doc = JobDoc(**doc) 

The problem is that uploading requires resolving the job references, so the Job gets deserialized, and thus it tries to deserialize the function associated with it. The Job object is used in a few other points for the runner, because in all these cases it uses Job's methods to perform the standard jobflow actions. One way to avoid this would be to not deserialize the Job, or at least part of it, and reimplement part of jobflow's functionalities so that it can work directly on the (partially) serialized version of the Job. Anyway the Job is used quite a few times, so I would need to properly check which functionalities are exactly needed. Do you have other suggestions to avoid the Job deserialization?

utf commented 7 months ago

Ok, I think this should be possible to fix. It is relatively easy to get the references without needing to deserialise the whole thing. I might already have some code in jobflow to do that. I'll have a look where is the full job object is used. Perhaps we can get away without deserialising any attributes of the job.

gpetretto commented 7 months ago

If you already have this part I will give it a try. Since the object is often deserialized in the Runner to access the content as properties and not navigating through nested dictionaries, I will need to go thorugh the points where this happens.

I would also add one potential downside of splitting the machine where the jf command and the daemon are executed. In some cases I check if the Runner is active before proceeding with an action. e.g.: https://github.com/Matgenix/jobflow-remote/blob/138a7b1763d12927e0e741d550e9dd9e750f588b/src/jobflow_remote/cli/job.py#L263 https://github.com/Matgenix/jobflow-remote/blob/d6ec17cb4f14cbbf7ed211101e8dd4fe1f723d8e/src/jobflow_remote/cli/admin.py#L59 Anyway, this is not a mandatory check, but the idea is helping the user to prevent running actions that may lead to inconsistencies.

utf commented 7 months ago

Here is the code that can find references. You can pass a serialised dict to this and it will return OutputReference objects:

https://github.com/materialsproject/jobflow/blob/8ffcf525688f261bd12bd4ffff129bf0fc0a9101/src/jobflow/core/reference.py#L357

It is quite convenient to have a centralised server running the daemon, I suppose a future update could have the daemon also run a server that can be connected to.

gpetretto commented 7 months ago

I have worked on avoiding to deserialize everything that required external packages for the Runner. There were several points that I needed to modify, as it was relying heavily on the object's attributes and methods. In some cases just for convenience in others to avoid reimplementing functionalities. However, even using find_and_get_references directly, deserialization happens down the line here: https://github.com/materialsproject/jobflow/blob/8ffcf525688f261bd12bd4ffff129bf0fc0a9101/src/jobflow/core/reference.py#L173 Of course this is done to be able to access object's attributes when resolving references and is an important part of jobflow. I have tested that in my case, this change seems to still work and solve the issue:

--- a/src/jobflow/core/reference.py
+++ b/src/jobflow/core/reference.py
@@ -170,7 +170,6 @@ class OutputReference(MSONable):
         data = cache[self.uuid][index]

         # decode objects before attribute access
-        data = MontyDecoder().process_decoded(data)

         # re-cache data in case other references need it
         cache[self.uuid][index] = data
@@ -180,7 +179,7 @@ class OutputReference(MSONable):
             data = (
                 data[attr]
                 if attr_type == "i" or isinstance(data, dict)
-                else getattr(data, attr)
+                else data.get(attr)
             )

         return data

I tested it with a couple of workflows (including atomate2) but I am not sure if there are cases that I am not considering. In addition this will break the standard managers, that I suppose already expect the references to be deserialized when resolved. So there should be a deserialize: bool argument in the method to switch on/off the deserialization. If you think this is an acceptable solution I can open a PR for jobflow.

I think that this changes in jobflow-remote are beneficial independently of the centralised server approach, since they will also contribute to reduce the Runner's work by avoiding a lot of (de)serializations. However this comes at the price of working with dictionaries instead of object and having reimplemented some of jobflow's functionalities. This can make the implementation a bit more fragile with respect to changes in jobflow. For this reason, before a final decision about this point is taken, I have put these latest changes in another branch (https://github.com/Matgenix/jobflow-remote/tree/references). Can you test it and check if this meets your requests?

davidwaroquiers commented 7 months ago

Sounds a good to me, I would say such an addition in jobflow would be reasonable with the deserialize switch argument. What do you think @utf ?

utf commented 7 months ago

Thanks so much for looking into this. I think adding a deserialise option sounds like a good solution and should work in most cases.

This won't work if the user tries to do something likejob.output.structure.volume which is a dynamically calculated property and will require having pymatgen installed to initialise the structure. In that case, there isn't any other option but to install pymatgen on the server.

davidwaroquiers commented 7 months ago

To make things flexible (your example with structure.volume will not work if jobflow-remote is always avoiding the deserialization), then we should have a config option in jobflow-remote to say "I want to always serialize/deserialize objects (or not)" then. What should be the default ? I would say to avoid deserialization but I am not so sure. Consider when (in the future) some jobs (e.g. supercell generation, refine_structure or things like this) are actually executed on the jobflow-remote server (instead of submitted to a queue of a cluster). Then you would actually need things to be deserialized.

gpetretto commented 7 months ago

Indeed, I did not think of derived properties. As @davidwaroquiers said, there is no way to tell beforehand whether the deserialization is needed or not. I see two possibilities:

  1. Add an option to the runner configuration, as proposed by David. Not sure what the default should be, since not deserailizing could be beneficial for all the users, but if one get an error it may not be obvious where this is coming from.
  2. I can propose a slightly different change to jobflow:

    --- a/src/jobflow/core/reference.py
    +++ b/src/jobflow/core/reference.py
    @@ -170,18 +170,13 @@ class OutputReference(MSONable):
         data = cache[self.uuid][index]
    
         # decode objects before attribute access
    -        data = MontyDecoder().process_decoded(data)
    
         # re-cache data in case other references need it
         cache[self.uuid][index] = data
    
         for attr_type, attr in self.attributes:
             # i means index else use attribute access
    -            data = (
    -                data[attr]
    -                if attr_type == "i" or isinstance(data, dict)
    -                else getattr(data, attr)
    -            )
    +            data = data[attr]
    
         return data

If it fails because of a KeyError, TypeError or IndexError it will try to resolve the references with deserialization. Again, it is difficult to say if this is beneficial of not, because depending whether this happens a lot or not can be either a waste of time or advantageous for the Runner.

What do you think?

davidwaroquiers commented 1 month ago

This can be closed as it was fixed in #47 right ? @gpetretto

gpetretto commented 1 month ago

I don't think this is completely solved. Quoting from the PR:

A machine only running the daemon still cannot avoid the presence of flows packages. https://github.com/materialsproject/jobflow/pull/512 needs to be merged before enabling it.

Some time has passed but I think we never decided how to handle the cases of derived properties. The point is that there are cases where it is mandatory to deserialize, or the references cannot be resolved. The options should still be those mentioned above: https://github.com/Matgenix/jobflow-remote/issues/41#issuecomment-1857596692