Open JaGeo opened 8 months ago
Hi @JaGeo,
thanks for reporting this. We have not used jobflow-remote for such large calculations yet, but I was somehow expecting that new issues would be triggered by starting using it for different use cases.
Can you be more explicit about where the issues happen? Where do the jobs fail? Do you have a stacktrace? Can you also share a simple script that would allow to reproduce the problem?
Have you encountered similar problems? Would one need to adapt how atomate2 workflows are written or are there possible other solutions? Locally, in my setup, for example, the MongoDB database could be used directly, and a JSONStore would not be needed.
You mean that MongoDB is reachable from the cluster nodes, and thus getting inputs and inserting outputs directly from the Jobs is possible? While I think it could be technically possible, I am a bit reluctant to go down this road, since with things happening in different ways may lead to some bugs and the code could be more difficult to maintain and expand. Let us first see if there is anything we can do to optimize the usage of the Store.
I will make an example script to reproduce later on today.
@gpetretto , here it is.
The structure I uploaded is nonsense but should illustrate the problem as it creates 24000 displacements (thus, leave out the optimization part of the phonon run). I get failures when the supercells are created in jobflow-remote but not in jobflow. The reported issues via slurm are memory errors.
Thanks! I will have a look at it. Just to be sure, when you say running with jobflow, you mean using run_locally
or with fireworks?
run_locally
with my default setup (MongoDB). In jobflow-remote, I used a local worker for that part and a remote worker. Both get killed. On the local machine, however, this part finished, when I use run_locally
I assume fireworks should work as well but I don't have a good MongoDB setup on the cluster where I wanted to run it.
Thanks. Actually I could have deduced from your script if I had looked before asking.
One potential issue with fireworks is that when you create the detours a single FWAction containing all the as_dict of all the detour Fireworks should be inserted in a launches
collection. This will very likely go beyond the 16MB limit. To handle these kind of situations I had added an option so that if the action is too big it would be moved to gridfs: https://github.com/materialsproject/fireworks/blob/e9f2004d3d861e2de288f0023ecdb5ccd9114939/fireworks/core/launchpad.py#L1512. If I remember correctly I had seen that this would be triggered with a few thousands of "normal" structures. I would be curious to see if other problems may arise with such a large number of very big structure in that case as well.
One problem is that I suppose run_locally
never needs to (de)serialize the FWAction
/jobflow's Response
. Just calling the as_dict
for such a huge object could require a considerable overhead of memory allocated, If that proves to be an issue, there is not much we can do to avoid that, being it either in FW or jobflow-remote.
I will test and see if I can pinpoint the issue. One more question: how much memory was allocated on the node?
I tested it on one 850 GB node and it failed.
It's also not failing while creating the Response but while creating the structures. A step before. Of course, I could combine both steps.
I have done some tests on this workflow and (de)serialization in general.
As a first comment I should correct what I said above concerning the execution of jobflow alone. While even running with a MemoryStore
, jobflow still does a serialization of the output by running a jsanitize
on the response of the job.
The second element is that I tried to check the timings of the execution of the generate_phonon_displacements
job when running it with run_locally
. When generating the 24000 structures on my laptop:
jsanitize
(possibly multiple times) on this huge object. In my case the job was killed after 1 hour and a half due to memory limits.At this point I think there are different problems with flows involving large Jobs that should be addressed in the different packages. I will start to highlight them here, but it may be worth opening issues in the respective repositories to check what would be possible.
Concerning jobflow, I think that it will always play badly with large data due to the fact that (de)serialization is expensive. Especially the JSON one. Whether you are storing the data in a file or in a DB, converting all the object to dictionaries and then transfering the data is going to take time and require a large amount of memory.
A minimal optimization would be to at least make sure that jsanitize
is called only once on a dictionary (at the moment I think it is called at least twice: one in Job.run
and the second in JobStore.update
). This will not improve much the situation, but at least would avoid pointless work.
Aside from this, I suspect that obtaining further improvements would require ad hoc solutions, with other kinds of (de)serialization or changes in the logic. For example big data might be stored directly into files that could be dumped to the file system and read in chunks. This could avoid loading all the data in memory at once or allow to access only the required information.
For these specific kind of workflows, involving relatively fast calculations and many big cells, it is possible that any workflow manager would add more overhead than it is worth. Considering the timings above, it should be considered that after the generate_phonon_displacements
the output should be deserialized by run_phonon_displacements
, that will return a response with 24000 new jobs, each of which should be serialized and added to the DB. And for each of the 24000 the inputs need to be deserialized and the outputs serialized to be inserted into the JobStore
. Given that all these (de)serialization may take hours, I am wondering if it would not be much faster to convert the phonon flow for force fields to a single Job that runs all the calculations. I suppose this could be a bit disappointing, but would avoid waste of computational resources and may even be faster in terms of time to result. While the use of jobflow-remote will definitely add more costs, the issues will still be there with any other manager (even the run_locally, given the timings above), so such a discussion would be more suitable in the atomate2 repository.
Again, dealing with large data might require ad-hoc solutions. (see also https://github.com/materialsproject/atomate2/issues/515)
Coming to what can be done to improve the situation for jobflow-remote, it is definitely true that the JSONStore
likely adds a lot of unnecessary overhead. However, the use of JSONStore
is not a strict contraint. I have implemented a minimal version of the Store, that avoids using mongomock and provides all the minimal required functionalities. From some tests I think this will give a much smaller overhead. I pushed the change here: https://github.com/Matgenix/jobflow-remote/tree/store
@JaGeo could you test it to see if this allows you to run the flow? Would you need a merge of this branch with the one for MFA?
The current implementation uses orjson, but from a few tests it seems that when deserializing a big json file the amount of used memory is larger than other json libraries. However, since there is no strict need to interact with other software outside jobflow-remote, any compatible format should be fine. I made an implementation for msgpack, that could be more efficient. Or other libraries like msgspec could be considered. I made some tests with these as well, but I don't have a conclusive answer with respect to the required memory and time for very large files. Also, should the very large files be the target? Or should we optimize for smaller file sizes? The choice of the (de)serialization library can be an option for the project and a user could tune depending on the kind of simulations that have to run.
The biggest downside though remains that the Runner will still need to load all those data to put and read from the final JobStore. While ithe Worker shold have enough memory to handle the big data, this is likely not true for the machine hosting the Runner. And I don't think there is any way of avoiding this, in the context of jobflow and JSON-like data.
One option already mentioned would be to allow the worker to directly interact with the JobStore, if accessible. However, I am afraid that this would have some problems of its own:
RUNNING
state (although this was already true for atomate 1 and 2 with fireworks, so it is probably a minor issue).On top of that, the limitations from standard jobflow listed above will still be there. So, I think that it would be worth addressing this first in jobflow and in the specific workflow. In any case I am not discarding this option.
During these tests I also realized one additional issue with the current version of jobflow-remote: if there is a datetime
object in the output dictionary it gets converted to a string during the dump to the JSONstore
(and even this minimal version). This likely needs to be addressed.
First of all, thank you so much for looking into it! And also for the detailed answer!
Atomate2
For these specific kind of workflows, involving relatively fast calculations and many big cells, it is possible that any workflow manager would add more overhead than it is worth. Considering the timings above, it should be considered that after the
generate_phonon_displacements
the output should be deserialized byrun_phonon_displacements
, that will return a response with 24000 new jobs, each of which should be serialized and added to the DB. And for each of the 24000 the inputs need to be deserialized and the outputs serialized to be inserted into theJobStore
. Given that all these (de)serialization may take hours, I am wondering if it would not be much faster to convert the phonon flow for force fields to a single Job that runs all the calculations. I suppose this could be a bit disappointing, but would avoid waste of computational resources and may even be faster in terms of time to result. While the use of jobflow-remote will definitely add more costs, the issues will still be there with any other manager (even the run_locally, given the timings above), so such a discussion would be more suitable in the atomate2 repository. Again, dealing with large data might require ad-hoc solutions. (see also materialsproject/atomate2#515)
I agree that potentially just running the whole workflow in jobflow is faster. However, any kind of check-pointing would be missing and a restart in case of a failure would be really hard. I would therefore prefer to run the flow with a workflow manager.
Jobflow remote
Coming to what can be done to improve the situation for jobflow-remote, it is definitely true that the
JSONStore
likely adds a lot of unnecessary overhead. However, the use ofJSONStore
is not a strict contraint. I have implemented a minimal version of the Store, that avoids using mongomock and provides all the minimal required functionalities. From some tests I think this will give a much smaller overhead. I pushed the change here: https://github.com/Matgenix/jobflow-remote/tree/store @JaGeo could you test it to see if this allows you to run the flow? Would you need a merge of this branch with the one for MFA?
Thank you so much, @gpetretto. Yes, I would need a merge of this branch with the one for MFA. Thanks in advance! I would love to test this.
Short additional question that is related, @gpetretto :
Is there currently a way to have a different worker for the children of a job as part of a Response
object, e.g. for the run_phonon_displacements
job and its static forcefield run children?
I agree that potentially just running the whole workflow in jobflow is faster. However, any kind of check-pointing would be missing and a restart in case of a failure would be really hard. I would therefore prefer to run the flow with a workflow manager.
I see your point. However, I would also suggest to make a few tests. If, exaggerating, the multi Job flow would require 10x the time required by the single job flow, it would still way more convenient to just rerun the whole workflow for the cases that fail. This could be true even for much lower speedup.
If willing to preserve the checkpoint, a smaller convenient change may be to merge generate_phonon_displacements
and run_phonon_displacements
in a single Job. This should avoid a whole serialization+deserialization of the 24000 structures. Which will probably save a couple of hours.
Another idea (that would require larger changes) would be to define a specific kind of file output for this use case, where you have a file with all the structures. For example the pymatgen Structure has a as_dataframe
method and the list of dataframes could be dumped to an hdf5 file. From a quick test for many big cells the whole procedure seems to be more than 10x faster than serializing to json. This is probably not the way to go and there are likely several issues that I have not considered, but just to give the idea that things can be made way faster at the cost of additional implementation.
Thank you so much, @gpetretto. Yes, I would need a merge of this branch with the one for MFA. Thanks in advance! I would love to test this.
Here is the branch: https://github.com/Matgenix/jobflow-remote/tree/store-interactive Thanks for the testing!
Short additional question that is related, @gpetretto : Is there currently a way to have a different worker for the children of a job as part of a
Response
object, e.g. for therun_phonon_displacements
job and its static forcefield run children?
set_run_config
calls jobflow's update_config
. This supports delayed updates: https://github.com/materialsproject/jobflow/blob/34e3d96ceef39693c1e2ff61b52543db95de4e8d/src/jobflow/core/job.py#L1042
If a generated job matches the name filter of the set_run_config
called on the Flow at creation time it should also be applied to dynamically generated jobs. Or at least it should. I definitely tested this, but not extensively ๐
Thank you! Will try to do it until the end of the week! I also hope, of course, that these issues help with making jobflow-remote even better and extending its usecases!
set_run_config
calls jobflow'supdate_config
. This supports delayed updates: https://github.com/materialsproject/jobflow/blob/34e3d96ceef39693c1e2ff61b52543db95de4e8d/src/jobflow/core/job.py#L1042 If a generated job matches the name filter of theset_run_config
called on the Flow at creation time it should also be applied to dynamically generated jobs. Or at least it should. I definitely tested this, but not extensively ๐
Just to add: this only works in one direction. If I say that run_phonon_displacements
should get the standard worker, I can set the worker of the children differently but not the other way around. Currently, there is no way to switch off the "dynamic" inheritance if the worker in jobflow_remote. It exists in jobflow, though.
Maybe, an additional flag could help. In any case, I could solve my issue now and will install the new jobflow-remote version for the test ;)
Tried it but it still needs more memory (did not kill the sever but I needed more than 80รท of the memory) and takes very long for the file writing process. It did not finish over night. Finding a sever where we can simply run the jobflow version for multiple weeks is likely still the better option.
Thanks a lot for testing this. I am sorry this did not solve your problem. While I expected it to take longer and require more memory it seemed to have a lower overhead in my tests. Of course I had to use a smaller test case, so maybe it scales badly at even larger sizes. I will try to do some more tests and optimization. I may ask you to test again if I seem to get something better. However, this overhead will remain an intrisic limitation of jobflow-remote.
Finding a sever where we can simply run the jobflow version for multiple weeks is likely still the better option.
Do you mean running the run_locally
in a single cluster job? I know I am probably getting annoying, but in that case you will still lose the checkpointing of the workflow, while keeping all its overhead. I suppose you already have one, but just in case I can share a small function that given a pymatgen structure, as ASE calculator and a supercell size will return a phonopy Phonon object. I don't think it will take more than 48 hours to run.
An additional downside of the current approach in atomate2 is that the 24000 structures will occupying a considerable amount of space in the output Store. There will be the whole output of generate_phonon_displacements
, plus 24000 output documents. Running many of these workflows is likely to quickly become an issue in terms of occupied space. Also, there will not be much benefit in storing each single output individually, because regenerating the list of structures or the full output for a specific Structure would have a negligible cost.
Thanks @gpetretto . We are using a GAP model for our calculations. Thus, it will take slightly longer time but yeah that might be an approach to take.
Hi all,
I have been playing around with jobflow-remote for the last weeks now. I really like it.
One thing, however, that leads to some issues was the following: I am trying to create a huge number of structures within the phonon workflow (>>4000 atoms, no symmetry, thus 24,000 structures). Of course, one could do MD; however, similar things could happen if one aims to calculate thermal conductivity. When I run this workflow with jobflow-remote, the jobs fail locally and remotely due to memory issues. There are no issues if I just run the same job in jobflow. Thus, I assume that the JSONStore and the saving process might lead to problems.
Have you encountered similar problems? Would one need to adapt how atomate2 workflows are written or are there possible other solutions? Locally, in my setup, for example, the MongoDB database could be used directly, and a JSONStore would not be needed.