Quantum-Accelerators / quacc

quacc is a flexible platform for computational materials science and quantum chemistry that is built for the big data era.
https://quantum-accelerators.github.io/quacc/
BSD 3-Clause "New" or "Revised" License
167 stars 46 forks source link

[feature request] be able to perform continuation with Quacc #2399

Closed tomdemeyere closed 1 month ago

tomdemeyere commented 1 month ago

What new feature would you like to see?

We want to perform a simple workflow with Quacc:

It seems easy, but in fact there are plenty of things that could go wrong here. Very often, computational chemistry is complex and things will not go as expected, e.g. calculations will crash, will run out of time, or simply won't converge.

In this case, the above workflow when using Quacc gets much more complicated, if users want to be proactive they can put all kinds of checks in place, i.e. try, except, but with workflow engines it is often not ideal. Then, the easiest route to avoid this issue seems to pre-filter which calculations you want to run based on which ones are already finished.

However, this route is a little bit complicated still:

What would be ideal is that for a given project you have a given database/results folder, and then for each calculation with a given label, Quacc would check if this calculation was already done, and converged, either from the database or from the "quacc_results.json. Settings or keywords might manage this behavior? Other things can be implemented to keep flexibility as well.

Maybe I am mistaken here and that's not what Quacc was made for? Although, for our practical case it is really problematic, and we often end up spending a lot of time look into Quacc's folders.

@Nekkrad @julianholland @BCAyers2000

Andrew-S-Rosen commented 1 month ago

Thank you for the note, @tomdemeyere and colleagues.

First, let me start with an unhelpful comment: most of your concerns are actually about the underlying workflow management system and not necessarily about quacc itself. In short, you are hitting pain points of Parsl. If the workflow engine were constructed well, it should be able to handle "calculations will crash, will run out of time, or simply won't converge". For instance, FireWorks/Covalent/Prefect allow you to redispatch jobs (and some have continuation logic), but Parsl does not make it easy. Other issues like caching of results so jobs aren't rerun are also done on the workflow management level. This is the problem with workflow engines --- we have hundreds of them, and they all are mediocre in various ways.

In spite of the above comment, I still think there is significant room to have quacc help the user out, and I would welcome any improvements in this regard.

There is no way to systematically label calculations in Quacc, some jobs allow to pass "additional_fields" but not all of them, as a result, finding which calculation is which is not easy as you don't have label criteria, you can use atoms.info but that's not ideal.

I agree this is an area for improvement. Many workflow tools, like FireWorks and Covalent, allow the user to add metadata to their calculations before executing them. This makes it trivial to link up to a given result that is stored. Parsl or Dask, on the other hand, do not do such a great job here. The additional_fields was sort of a way to get around this, but it's also not perfect.

I don't necessarily have a suggestion here but agree there is more to do.

Finding which calculation crashed for what reason in the sea of "quacc-xxxxx-...." folders is less than ideal. I strongly believe that the reason for the crash should be somewhere in the results dict.

Dashboards are a core part of many workflow engines and can help in this regard. Still, there is room for improvement within quacc itself for cases where a dashboard is not viable.

The sea of quacc folders will hopefully become easier to manage with https://github.com/Quantum-Accelerators/quacc/pull/2296, which is not stale despite the lack of movement you see there on the PR. The schema can only be generated, however, if the calculation finishes. The results dictionary is not really the place to store task metadata if the job fails.

So, this begs the question: where should we store that data if the user isn't using a workflow engine that can handle task metadata well? Again, this is something that a workflow engine should be providing for the user, but if we want to have a redundant solution that is fine. Covalent, FireWorks, Redun, and others make it clear to the user where things are failing and why. Parsl does not. Technically, shouldn't the errors be logged to stderr though via the terminate() function? That should ideally make it clear which calculations have failed. It won't include why the job failed, but we would have that info stored if someone creates an MR for ASE that addresses https://gitlab.com/ase/ase/-/issues/1391.

What would be ideal is that for a given project you have a given database/results folder

This can already be done by swapping out RESULTS_DIR for a specific project.

and then for each calculation with a given label, Quacc would check if this calculation was already done, and converged, either from the database or from the "quacc_results.json. Settings or keywords might manage this behavior? Other things can be implemented to keep flexibility as well.

This is getting into workflow management territory. Quacc has no knowledge about a database on its own. What you are describing is similar to caching, which many workflow tools support (e.g. https://parsl.readthedocs.io/en/1.0.0/userguide/app_caching.html).

Maybe I am mistaken here and that's not what Quacc was made for? Although, for our practical case it is really problematic, and we often end up spending a lot of time look into Quacc's folders.

All of these pain points are pain points of Parsl, which does a good job at launching tasks but not managing them. That's not to say any of the other supported workflow tools are better by the way. They all are painful in their own special ways, which is why I couldn't settle with just supporting one of them. Again, I'm happy to add features to quacc to address the limitations of the workflow tools themselves though.

Sorry. I understand that this is not a "fix" for your issues. The issues you bring up are important but are also non-trivial and require dedicated work from an individual to address. Sadly, that isn't going to be me at this very point in time, but I am open to addressing your concerns.

tomdemeyere commented 1 month ago

Thanks for your answer @Andrew-S-Rosen, I must say that I understand your need for balance between the work of QuAcc and the work that should be done by workflow engines, let me try to push some ideas before giving up:

First, let me start with an unhelpful comment: most of your concerns are actually about the underlying workflow management system and not necessarily about quacc itself. In short, you are hitting pain points of Parsl. If the workflow engine were constructed well, it should be able to handle "calculations will crash, will run out of time, or simply won't converge". For instance, FireWorks/Covalent/Prefect allow you to redispatch jobs (and some have continuation logic), but Parsl does not make it easy. Other issues like caching of results so jobs aren't rerun are also done on the workflow management level. This is the problem with workflow engines --- we have hundreds of them, and they all are mediocre in various ways.

While QuAcc shouldn't shoulder the entire burden of workflow management, there are indeed some features that could enhance its functionality without overstepping its primary role. Here are some suggestions for improvements:

  1. Job annotation: Implement a consistent additional_fields parameter across all recipes. This would allow users to add custom metadata, such as unique slab names based on Miller indices and terminations (that's our case), which would then be stored in the result dictionary. This would greatly improve traceability and organisation of calculations

  2. Error handling: Enhance QuAcc error management by catching errors from ASE check_call functrion. The caught error information could be written to the database and/or in the "quacc_results.json" file, along with any other available data. Subsequently, QuAcc could raise a custom QuaccException that includes essential information like the working directory location. This would enable users to implement more sophisticated try/except statements in their workflows.

  3. Restart mode: Introduce a "RESTART_MODE" option that, when activated, would skip completed calculations and re-launch failed ones. This feature could operate by checking the database or "quacc_results.json" files. To identify which calculations is which, users could either provide a custom identifier string or QuAcc could generate a unique hash based on the function parameters. If a matching identifier is found in the database or files, the job would be either re-launched/continued (if previously failed) or skipped (if previously successful).

But as you said, one problem is the potential overlap of these features with the already existing one in some workflow engine. Normally there should not be the need for such functionalities (especially the last one, which I agree should be taken care of by the workflow engine). But at the same time I will come back on something you wrote:

They all are painful in their own special ways

This is very true, and after testing some of them I can say that there is always little detail(s) that will make a given engine unusable in a particular situation, due to a mix between technical constraints and specific needs for a project. If QuAcc had the above functionalities in-built this would allow to alleviate the dependency on workflow engine-specific features

I believe the first two suggestions (job annotation and error handling) are relatively neutral additions that would enhance QuAcc's usability without significantly overstepping on workflow engine territory. While the restart mode feature might be more contentious, it could serve as a fallback option for users dealing with workflow engines that lack robust continuation logic (or when running QuAcc without a workflow engine, which is my main usage for now)

EDIT:

The results dictionary is not really the place to store task metadata if the job fails

I am not sure I agree with that, sometimes the job failing is an important part of the process: If you pick a mixing scheme and want to test its robustness for high-throughput, some jobs will surely fail, and that is an important results. Often, a job might return an error to check_call but there are still results to be treated as well. I am thinking about the case of MD, where the job might timeout, but it does not mean that the results are not usable?

Andrew-S-Rosen commented 1 month ago

@tomdemeyere: Thank you for your comments. Could you please do me a favor and open three specific issues, one for each topic? You can simply copy/paste your examples here but please link back to this issue. You can then close this "meta-issue". That will help ensure things hopefully get addressed.

I must say that I understand your need for balance between the work of QuAcc and the work that should be done by workflow engines, let me try to push some ideas before giving up:

I am all ears. No need to worry about giving up. My goal at this point is to get additional clarity so we can brainstorm how to best proceed.

Job annotation: Implement a consistent additional_fields parameter across all recipes. This would allow users to add custom metadata, such as unique slab names based on Miller indices and terminations (that's our case), which would then be stored in the result dictionary. This would greatly improve traceability and organisation of calculations

Yes, agreed. This was actually the original intention but never made its way into main. This should be done in some way. I also resorted to atoms.info at some point, but this is dangerous because there's no guarantee that the info remains when the Atoms object is mutated.

Error handling: Enhance QuAcc error management by catching errors from ASE check_call functrion.... Subsequently, QuAcc could raise a custom QuaccException that includes essential information like the working directory location. This would enable users to implement more sophisticated try/except statements in their workflows.

I think the error should already be captured here and is logged with its directory. Is there another error you're referring to? It should also be clear which jobs failed --- they are written out to directories with failed- in front. I'm trying to understand a bit more what is missing here.

The caught error information could be written to the database and/or in the "quacc_results.json" file, along with any other available data.

This is actually a separate feature request: write out quacc_results.json for every job regardless of success or failure. This is interesting to consider. First, it's worth noting that it can't always be done no matter how hard we try. After all, if you submit a job regularly to the Slurm queue and it times out, there's nothing we can do once timeout is reached. But ignoring that aspect, right now the way things work is that once ASE says a job fails, we terminate before a schema is even attempted to be made. To do what you're suggesting, we would need to try to prepare a schema, write it to disk, and then terminate. Output files will also not always be able to be parsed (e.g. if the calculator crashes instantly), and this would cause the schema generation to crash.

There is likely going to be an easier way to do what you want. Perhaps the solution you're looking for is to write out the info not just to the logger but also to disk in that calculation's failed- directory? That won't be particularly useful until ASE captures stderr for us though.

Restart mode: Introduce a "RESTART_MODE" option that, when activated, would skip completed calculations and re-launch failed ones. This feature could operate by checking the database or "quacc_results.json" files. To identify which calculations is which, users could either provide a custom identifier string or QuAcc could generate a unique hash based on the function parameters. If a matching identifier is found in the database or files, the job would be either re-launched/continued (if previously failed) or skipped (if previously successful).

Indeed, this does have overap with things like the retry handlers in Parsl. Most workflow engines have something along these lines, but I agree it's not always intuitive or useful as-implemented. I don't know how easy it would be to do this in quacc (we would have to serialize/deserialize functions with pickle, which gets messy...), but I'll keep this one in mind. We also can't assume that the user is running with a database. This is one I'm open to, but I admittedly don't know how to go about it nicely offhand.

To me, this one seems more like workflow engine territory. If we provide the user with enough metadata to know which jobs have failed, they can hopefully write their own scripts to rerun what is needed.

I am not sure I agree with that, sometimes the job failing is an important part of the process: If you pick a mixing scheme and want to test its robustness for high-throughput, some jobs will surely fail, and that is an important results. Often, a job might return an error to check_call but there are still results to be treated as well. I am thinking about the case of MD, where the job might timeout, but it does not mean that the results are not usable?

This is fair. Just to note, we of course can't have every @job run to completion without error, as errors still need to be displayed in things like workflow engine dashboards so users can troubleshoot. But I agree that things like MD timeouts are usually not errors per se.

Andrew-S-Rosen commented 1 month ago

Closing in favor of #2407, #2408, #2409. The title proposal of a restart mode is unlikely to be implemented in quacc directly.