Open utf opened 7 months ago
I had this question too, I think the answer from @gpetretto / @davidwaroquiers at the time was that the runner is also intended to be run locally (i.e., the JFR server is just for the database). I find this a little counter-intuitive and think we need to do better separating compute environments from workflow environments.
I should say that we usually don't have a separation between the "local" machine and the one running the daemon. So up to now we always had only two entities: 1) the machine where the daemon runs and that is used to submit calculations 2) the HPC system.
I guess the problem that you are raising is not so trivial. First, it does not happen during the check_out (since the state is CHECKED_OUT it means that the checkout already happened). The problem is when uploading. See:
File "/home/alex/src/jobflow-remote/src/jobflow_remote/jobs/runner.py", line 266, in upload job_doc = JobDoc(**doc)
The problem is that uploading requires resolving the job references, so the Job
gets deserialized, and thus it tries to deserialize the function associated with it. The Job
object is used in a few other points for the runner, because in all these cases it uses Job
's methods to perform the standard jobflow actions.
One way to avoid this would be to not deserialize the Job
, or at least part of it, and reimplement part of jobflow's functionalities so that it can work directly on the (partially) serialized version of the Job
. Anyway the Job
is used quite a few times, so I would need to properly check which functionalities are exactly needed.
Do you have other suggestions to avoid the Job deserialization?
Ok, I think this should be possible to fix. It is relatively easy to get the references without needing to deserialise the whole thing. I might already have some code in jobflow to do that. I'll have a look where is the full job object is used. Perhaps we can get away without deserialising any attributes of the job.
If you already have this part I will give it a try. Since the object is often deserialized in the Runner to access the content as properties and not navigating through nested dictionaries, I will need to go thorugh the points where this happens.
I would also add one potential downside of splitting the machine where the jf command and the daemon are executed. In some cases I check if the Runner is active before proceeding with an action. e.g.: https://github.com/Matgenix/jobflow-remote/blob/138a7b1763d12927e0e741d550e9dd9e750f588b/src/jobflow_remote/cli/job.py#L263 https://github.com/Matgenix/jobflow-remote/blob/d6ec17cb4f14cbbf7ed211101e8dd4fe1f723d8e/src/jobflow_remote/cli/admin.py#L59 Anyway, this is not a mandatory check, but the idea is helping the user to prevent running actions that may lead to inconsistencies.
Here is the code that can find references. You can pass a serialised dict to this and it will return OutputReference
objects:
It is quite convenient to have a centralised server running the daemon, I suppose a future update could have the daemon also run a server that can be connected to.
I have worked on avoiding to deserialize everything that required external packages for the Runner. There were several points that I needed to modify, as it was relying heavily on the object's attributes and methods. In some cases just for convenience in others to avoid reimplementing functionalities.
However, even using find_and_get_references
directly, deserialization happens down the line here:
https://github.com/materialsproject/jobflow/blob/8ffcf525688f261bd12bd4ffff129bf0fc0a9101/src/jobflow/core/reference.py#L173
Of course this is done to be able to access object's attributes when resolving references and is an important part of jobflow. I have tested that in my case, this change seems to still work and solve the issue:
--- a/src/jobflow/core/reference.py
+++ b/src/jobflow/core/reference.py
@@ -170,7 +170,6 @@ class OutputReference(MSONable):
data = cache[self.uuid][index]
# decode objects before attribute access
- data = MontyDecoder().process_decoded(data)
# re-cache data in case other references need it
cache[self.uuid][index] = data
@@ -180,7 +179,7 @@ class OutputReference(MSONable):
data = (
data[attr]
if attr_type == "i" or isinstance(data, dict)
- else getattr(data, attr)
+ else data.get(attr)
)
return data
I tested it with a couple of workflows (including atomate2) but I am not sure if there are cases that I am not considering.
In addition this will break the standard managers, that I suppose already expect the references to be deserialized when resolved. So there should be a deserialize: bool
argument in the method to switch on/off the deserialization. If you think this is an acceptable solution I can open a PR for jobflow.
I think that this changes in jobflow-remote are beneficial independently of the centralised server approach, since they will also contribute to reduce the Runner
's work by avoiding a lot of (de)serializations. However this comes at the price of working with dictionaries instead of object and having reimplemented some of jobflow's functionalities. This can make the implementation a bit more fragile with respect to changes in jobflow. For this reason, before a final decision about this point is taken, I have put these latest changes in another branch (https://github.com/Matgenix/jobflow-remote/tree/references). Can you test it and check if this meets your requests?
Sounds a good to me, I would say such an addition in jobflow would be reasonable with the deserialize switch argument. What do you think @utf ?
Thanks so much for looking into this. I think adding a deserialise option sounds like a good solution and should work in most cases.
This won't work if the user tries to do something likejob.output.structure.volume
which is a dynamically calculated property and will require having pymatgen installed to initialise the structure. In that case, there isn't any other option but to install pymatgen on the server.
To make things flexible (your example with structure.volume will not work if jobflow-remote is always avoiding the deserialization), then we should have a config option in jobflow-remote to say "I want to always serialize/deserialize objects (or not)" then. What should be the default ? I would say to avoid deserialization but I am not so sure. Consider when (in the future) some jobs (e.g. supercell generation, refine_structure or things like this) are actually executed on the jobflow-remote server (instead of submitted to a queue of a cluster). Then you would actually need things to be deserialized.
Indeed, I did not think of derived properties. As @davidwaroquiers said, there is no way to tell beforehand whether the deserialization is needed or not. I see two possibilities:
I can propose a slightly different change to jobflow:
--- a/src/jobflow/core/reference.py
+++ b/src/jobflow/core/reference.py
@@ -170,18 +170,13 @@ class OutputReference(MSONable):
data = cache[self.uuid][index]
# decode objects before attribute access
- data = MontyDecoder().process_decoded(data)
# re-cache data in case other references need it
cache[self.uuid][index] = data
for attr_type, attr in self.attributes:
# i means index else use attribute access
- data = (
- data[attr]
- if attr_type == "i" or isinstance(data, dict)
- else getattr(data, attr)
- )
+ data = data[attr]
return data
If it fails because of a KeyError
, TypeError
or IndexError
it will try to resolve the references with deserialization. Again, it is difficult to say if this is beneficial of not, because depending whether this happens a lot or not can be either a waste of time or advantageous for the Runner.
What do you think?
This can be closed as it was fixed in #47 right ? @gpetretto
I don't think this is completely solved. Quoting from the PR:
A machine only running the daemon still cannot avoid the presence of flows packages. https://github.com/materialsproject/jobflow/pull/512 needs to be merged before enabling it.
Some time has passed but I think we never decided how to handle the cases of derived properties. The point is that there are cases where it is mandatory to deserialize, or the references cannot be resolved. The options should still be those mentioned above: https://github.com/Matgenix/jobflow-remote/issues/41#issuecomment-1857596692
My setup is as follows:
I thought I would only need to install atomate2 on the HPC remote and local computer, since all code specific to atomate2 will only get executed on these machines. However, when trying to run a workflow in this configuration, I get an error on checkout
Is this the intended behaviour?