Closed mnblonsky closed 1 year ago
@mnblonsky, is this still an issue?
To answer the question, the HELICS runner is designed to kill the entire federation when one federate fails and if this is what you're seeing, I would say you're right in calling it a bug. There is a flag you can add to the federates that kills them if one federate in the federation dies: terminate_on_error
(And sorry this comment is literally months late. I forgot to watch this repository and didn't even see this new issue.)
That looks like it will do the trick. I'll try it out soon. Thanks!
Yes, that worked, thanks @trevorhardy. This still may be an issue for other cases if you don't want to use terminate_on_error
, but maybe that's not very common.
OK; good to hear.
As I've been thinking about it more, I'm not sure the death of one federate should necessarily cause the entire federation to fail (given how I understand the HELICS architecture). I'll close this out for now.
I'm running a federation with a bug in one of the federates, which causes that federate to fail. The other federates do not have any timeout check, so they get hung waiting to receive communications from the failed federate. This causes the whole federation to get hung, rather than fail. This seems like a bug, but maybe it's a feature: When 1 federate within a federation fails, should the whole federation fail?
It seems the behavior could be changed if you use
process.poll()
with a timeout rather thanprocess.wait()
here: https://github.com/GMLC-TDC/pyhelics/blob/796ff37a0de51beaed32ab98dd544c8aa4c954f4/helics/cli.py#L285I haven't made a minimal working example yet, but I can make one if you think it's necessary.
Environment