ITISFoundation / osparc-meta-dakota

oSPARC service to run a Dakota study based on a Dakota configuration file.
0 stars 0 forks source link

Resilience to "soft" errors #5

Open JavierGOrdonnez opened 2 weeks ago

JavierGOrdonnez commented 2 weeks ago

User story

As a user, I would like to be able to iterate on a failing dakota.in file (and osparc template) until it runs successfully.

Current behaviour

Dakota Service returns an error and is not reactive to further inputs (see this line).

Although a dakota.in could be iterated on on a local setup to avoid this issue, this poses two complications:

With "soft errors" I refer to everything of the form:

 DakotaService: [osparc-meta-dakota:0.1.0] Traceback (most recent call last):
DakotaService: [osparc-meta-dakota:0.1.0]   File "/docker/dakota-start.py", line 55, in main
DakotaService: [osparc-meta-dakota:0.1.0]     dakota_service.start()
DakotaService: [osparc-meta-dakota:0.1.0]   File "/docker/dakota-start.py", line 120, in start
DakotaService: [osparc-meta-dakota:0.1.0]     self.start_dakota(dakota_conf, self.output0_dir_path)
DakotaService: [osparc-meta-dakota:0.1.0]   File "/docker/dakota-start.py", line 166, in start_dakota
DakotaService: [osparc-meta-dakota:0.1.0]     study.execute()
DakotaService: [osparc-meta-dakota:0.1.0] RuntimeError: Dakota aborted: Unknown error 252
...

e.g. everything being correctly handled by the except statement mentioned above. Hard errors (e.g. the script failing somewhere else) are out of scope.

Desired behaviour

Such errors should be logged same as now, but then the script returns to the state in line 55 - e.g. DakotaService.start() is executed again.

The DakotaService object should be the same (so that no new handshake is needed) and register which is the input file that gave the error, and only proceed to execution if a new dakota.in is sent.

PS I ignore if the sidecar repeatedly copies the Notebook output to the Dakota input, or only when such output has changed. That will affect how the "new" dakota.in detection should be carried out - either by file information, watchdog, or file contents.

wvangeit commented 2 weeks ago

Did you try on your own machine to see if this behavior actually works in dakota? I changed the service code, but it seems the dakota python process, as I kind of was afraid of, is not able to recover from an error.

JavierGOrdonnez commented 2 weeks ago

I can test the dakota-itis wheel, but not the oSPARC service (I dont deploy locally) nor the interface with python (which is to be handled by the ParallelRunner). I would be interested to see your setup, and maybe I can investigate it myself as well. Thank you.

wvangeit commented 2 weeks ago

It has nothing to do with the service itself though. My question was if you tried it locally with the dakota wheel. It seems dakota python can't recover from these errors. (unless you found a way around it)