Open tomdemeyere opened 3 months ago
For the first point, it seems that it does not work correctly currently as the environment variable is retrieved in settings.py which is probably done before the job is launched (https://github.com/Quantum-Accelerators/quacc/issues/2411).
This one seems relatively trivial to fix if we can understand what the problem is. Quacc does correctly pick up on changes to PARSL_MPI_PREFIX
, so it's not clear to me what the issue is you're encountering. Presumably, if you launch the Python script as a Slurm job and export PARSL_MPI_PREFIX
before you execute the Python script, it should pick up on it just fine.
In [1]: import os
In [2]: os.environ["PARSL_MPI_PREFIX"]="cow"
In [3]: import quacc
In [4]: from quacc import QuaccSettings
In [5]: QuaccSettings().ESPRESSO_PARALLEL_CMD
Out[5]: ('cow', '')
In [6]: os.environ["PARSL_MPI_PREFIX"]="sheep"
In [7]: QuaccSettings().ESPRESSO_PARALLEL_CMD
Out[7]: ('sheep', '')
For the second point, I think the Parsl checkpointing currently does not work with Quacc, as Parsl keeps complaining about Atoms object not being hashable (to check).
This is not too surprising, as classes are a bit more complex to hash.
As noted in the Parsl docs: By default Parsl can compute sensible hashes for basic data types: str, int, float, None, as well as more some complex types: functions, and dictionaries and lists containing hashable types. Attempting to cache apps invoked with other, non-hashable, data types will lead to an exception at invocation. In that case, mechanisms to hash new types can be registered by a program by implementing the parsl.dataflow.memoization.id_for_memo function for the new type.
So, if we want to help the user out here, we need to supply a custom hashing function that users can rely on. We already have a hashing algorithm implemented for Atoms
objects: it's quacc.atoms.core.get_atoms_id
.
If possible, it should be possible to ask for retries in case of job failures. If possible there should be various options depending on the nature of the error, and it should be possible to customize the input parameters for retries (e.g. my meta-gga calculation failed, I lower the mixing_beta, etc...)
In the context of Parsl or any other workflow engine, this seems like it should be an improvement to the retry handlers there.
Independent of Parsl, we won't be implementing retry handling in quacc itself. This is already handled beautifully by a pre-existing dependency called Custodian. Custodian is currently used for VASP and Q-Chem runs. It supports both pre-made and custom error handlers and is quite robust. Users interested in implementing generic retry handlers for a specific code are best off implementing those features in Custodian and having the calculator call Custodian.
This one seems relatively trivial to fix if we can understand what the problem is. Quacc does correctly pick up on changes to PARSL_MPI_PREFIX, so it's not clear to me what the issue is you're encountering. Presumably, if you launch the Python script as a Slurm job and export PARSL_MPI_PREFIX before you execute the Python script, it should pick up on it just fine.
I think it might be a little bit more complicated than that. Because Parsl needs to be able to handle cases where different apps have different resources, so it seems to me that the env variable is defined "per function" somewhat. I will investigate when I will have more time. Note that in your example you defined the environment variable before importing Quacc, while right now, it happens definitely after (when the function is called). That might be it?
As noted in the Parsl docs: By default Parsl can compute sensible hashes for basic data types: str, int, float, None, as well as more some complex types: functions and dictionaries and lists containing hashable types. Attempting to cache apps invoked with other, non-hashable, data types will lead to an exception at invocation. In that case, mechanisms to hash new types can be registered by a program by implementing the parsl.dataflow.memoization.id_for_memo function for the new type.
Yes, I think that's pretty much it, do you think this has its place in the Quacc documentation? Can this be done automatically when Quacc start (if Parsl is the workflow engine of course)?
Independent of Parsl, we won't be implementing retry handling in quacc itself.
Yep, that's exactly why I opened this issue, so that we can look at what features are missing for each workflow engine (Parsl here) and improve these areas without the need to modify Quacc too much (improve the documentation maybe?). The reason for that is: we are trying to use Quacc/Parsl for a project with collaborators and we ran into these questions, which I think are important. If you feel like I am overstepping please feel free to tell me/close this issue.
I think it might be a little bit more complicated than that. Because Parsl needs to be able to handle cases where different apps have different resources, so it seems to me that the env variable is defined "per function" somewhat.
I see. We already have support for supplying parsl_resource_specification
via the decorator, but I see that this kind of flexibility is not possible for the parsl MPI prefix. There must be a misunderstanding about intended uses with the Parsl team, because it does not make sense to rely on an environment variable (PARSL_MPI_PREFIX
) that changes on a per-app basis. Perhaps I'm missing something. I guess they just assume that all the apps are running the same code in the same parallelized way? That seems like a potential oversight worth bringing up with them.
Note that in your example you defined the environment variable before importing Quacc, while right now, it happens definitely after (when the function is called). That might be it?
That also seems to work okay when I test it out.
Yes, I think that's pretty much it, do you think this has its place in the Quacc documentation? Can this be done automatically when Quacc start (if Parsl is the workflow engine of course)?
Definitely, I think this should be provided with the Parsl usage instructions. I don't actually know how the memo function works with Parsl, but once someone figures it out and confirms it works as expected, the details should definitely be added.
As for whether it can be done automatically, I try to set things up automatically when the workflow engine is selected but if this is something that must be supplied when you instantiate a Parsl config, that might need to be left for the end-user (but still documented). In other words, if we can automate it, absolutely, but I don't know if we can.
Yep, that's exactly why I opened this issue, so that we can look at what features are missing for each workflow engine (Parsl here) and improve these areas without the need to modify Quacc too much (improve the documentation maybe?). The reason for that is: we are trying to use Quacc/Parsl for a project with collaborators and we ran into these questions, which I think are important. If you feel like I am overstepping please feel free to tell me/close this issue.
Yes, this is all very fair. And I appreciate the comments! You are not overstepping.
Personally, I think improved retry handling is something that should be done on Parsl's side (i.e. make it easier to inject logic into the retry handler). There is technically a way to do this right now, but it's hacky. The suggestion from the Parsl team is to take advantage of the rety_handler
keyword argument in Config
and supply a custom function that introduces some "smart" logic for when/how to correct things. Seems very messy and prone to lots of problems though.
What would you like to report?
Here is a list of requirements that (in my opinion) should be checked to ensure that the combo QuAcc/Parsl is production-ready.
[ ] Use the full capabilities of the software and the HPC: run flexible parallel commands without additional complexity. For now, this is done by providing
parsl_resource_specification
parameter in the job decorator. This gives access to thePARSL_MPI_PREFIX
environment variable that should be used to launch the job. This is critical because without it jobs are run fully relying on the good faith of the task executor (srun mainly) to correctly dispatch jobs to the nodes. For example, this does not work on the MMM Young cluster which usesmpirun
.[ ] HPCs time limitations should not be a burden: For this Parsl checkpoints seem like a natural solution: they allow users to keep results in the cache, and to restart a workflow only where it stopped.
[ ] If possible, it should be possible to ask for retries in case of job failures. If possible there should be various options depending on the nature of the error, and it should be possible to customize the input parameters for retries (e.g. my meta-gga calculation failed, I lower the
mixing_beta
, etc...)For the first point, it seems that it does not work correctly currently as the environment variable is retrieved in
settings.py
which is probably done before the job is launched (https://github.com/Quantum-Accelerators/quacc/issues/2411).For the second point, I think the Parsl checkpointing currently does not work with Quacc, as Parsl keeps complaining about Atoms object not being hashable (to check).
For the third point, Parsl retry handling does not allow to specify new parameters. This makes it generally not fitted for computational chemistry as most users will not care much about blindly repeating the same calculations. Some Parsl team members suggested that, conceptually, if one wants to change the parameters, they should instead launch a new job, this approach will be definitely possible with in-built try-except statements, after (https://github.com/Quantum-Accelerators/quacc/pull/2410), although somewhat limited.
Additionally, reading QuAcc documentation it is a little bit confusing what functionality is usable with QuAcc.
I will work on this when I have more time to do what I like (end of september 😅) if I still have access to some HPCs.