Co-simulation hangs when 1 federate fails

mnblonsky commented 1 year ago

I'm running a federation with a bug in one of the federates, which causes that federate to fail. The other federates do not have any timeout check, so they get hung waiting to receive communications from the failed federate. This causes the whole federation to get hung, rather than fail. This seems like a bug, but maybe it's a feature: When 1 federate within a federation fails, should the whole federation fail?

It seems the behavior could be changed if you use process.poll() with a timeout rather than process.wait() here: https://github.com/GMLC-TDC/pyhelics/blob/796ff37a0de51beaed32ab98dd544c8aa4c954f4/helics/cli.py#L285

I haven't made a minimal working example yet, but I can make one if you think it's necessary.

Environment

Operating System: Linux (HPC)
Installation: `pip install helics[cli]
helics and pyhelics version:

helics, version v3.3.2

Python HELICS version v3.3.2

HELICS Library version 3.3.2 (2022-12-02)

{
    "buildflags": " -static-libstdc++ -static-libgcc -O3 -DNDEBUG -static-libstdc++ -static-libgcc  $<$<COMPILE_LANGUAGE:CXX>:-std=c++17>",
    "compiler": "Unix Makefiles  Linux-5.15.0-1023-azure:GNU-8.3.1",
    "cores": [
        "zmq",
        "zmqss",
        "tcp",
        "tcpss",
        "udp",
        "ipc",
        "interprocess",
        "inproc"
    ],
    "cpu": " Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz",
    "cpucount": 24,
    "cputype": "x86_64",
    "hostname": "el3",
    "memory": "192789 MB",
    "os": "Linux  3.10.0-1062.9.1.el7.x86_64  #1 SMP Fri Dec 6 15:49:49 UTC 2019",
    "version": {
        "build": "",
        "major": 3,
        "minor": 3,
        "patch": 2,
        "string": "3.3.2 (2022-12-02)"
    },
    "zmqversion": "ZMQ v4.3.4"
}

trevorhardy commented 1 year ago

@mnblonsky, is this still an issue?

To answer the question, the HELICS runner is designed to kill the entire federation when one federate fails and if this is what you're seeing, I would say you're right in calling it a bug. There is a flag you can add to the federates that kills them if one federate in the federation dies: terminate_on_error

(And sorry this comment is literally months late. I forgot to watch this repository and didn't even see this new issue.)

mnblonsky commented 1 year ago

That looks like it will do the trick. I'll try it out soon. Thanks!

mnblonsky commented 1 year ago

Yes, that worked, thanks @trevorhardy. This still may be an issue for other cases if you don't want to use terminate_on_error, but maybe that's not very common.

trevorhardy commented 1 year ago

OK; good to hear.

As I've been thinking about it more, I'm not sure the death of one federate should necessarily cause the entire federation to fail (given how I understand the HELICS architecture). I'll close this out for now.

GMLC-TDC / pyhelics

Co-simulation hangs when 1 federate fails #66