Closed ghuls closed 1 year ago
Any idea what could be the cause of this?
I can't be sure without seeing more (e.g. server) logs. These can be enabled by the following execution:
$ RUST_LOG=hq=debug,tako=debug RUST_BACKTRACE=full hq server start
It seems like it was some race condition where we tried to send a message to a client that was disconnecting at the same time. In any case, this should not crash the server. I changed this behavior in this PR.
I think I now know how I managed to create this panic with the version before the patch.
Running hq dashboard
caused twice the crash (sometimes it ran fine).
Running the latest hq now: commit ccec902f6a45fb6230bf709bdf2cf4d22aad8ff7
The server doesn't panic anymore, but after I killed one worked instance, starting hq dashboard
will close hq dashboard
after one second and return ERROR Cannot reply to client: IoError(Custom { kind: InvalidInput, error: LengthDelimitedCodecError })
in the server:
$ RUST_LOG=hq=debug,tako=debug RUST_BACKTRACE=full hq dashboard
[2023-01-18T22:21:54.320Z DEBUG tako::internal::transfer::auth] Worker authorization started
[2023-01-18T22:21:54.320Z DEBUG tako::internal::transfer::auth] Challenge verification started
[2023-01-18T22:21:54.320Z DEBUG tako::internal::transfer::auth] Challenge verification finished
$
$ hq server
...
2023-01-18T11:34:12Z INFO Worker 1 registered from 10.118.228.69:46330
2023-01-18T11:34:12Z WARN Worker 1 belongs to an unknown allocation 60035964
2023-01-18T11:35:09Z INFO Worker 2 registered from 10.118.228.70:33974
2023-01-18T11:35:09Z WARN Worker 2 belongs to an unknown allocation 60035967
2023-01-18T11:36:37Z INFO Worker 3 registered from 10.118.228.76:53584
2023-01-18T11:36:37Z WARN Worker 3 belongs to an unknown allocation 60036535
2023-01-18T11:36:45Z INFO Worker 4 registered from 10.118.228.79:47034
2023-01-18T11:36:45Z WARN Worker 4 belongs to an unknown allocation 60036536
2023-01-18T11:36:54Z INFO Worker 5 registered from 10.118.228.80:40800
2023-01-18T11:36:54Z WARN Worker 5 belongs to an unknown allocation 60036537
2023-01-18T11:37:01Z INFO Worker 6 registered from 10.118.228.81:54800
2023-01-18T11:37:01Z WARN Worker 6 belongs to an unknown allocation 60036538
2023-01-18T11:37:06Z INFO Worker 7 registered from 10.118.230.3:58912
2023-01-18T11:37:17Z INFO Worker 8 registered from 10.118.230.3:58928
2023-01-18T11:37:32Z INFO Worker 9 registered from 10.118.228.73:60644
2023-01-18T11:37:32Z WARN Worker 9 belongs to an unknown allocation 60036539
2023-01-18T22:14:44Z INFO Worker 7 connection closed (connection: 10.118.230.3:58912)
2023-01-18T22:15:28Z ERROR Cannot reply to client: IoError(Custom { kind: InvalidInput, error: LengthDelimitedCodecError })
2023-01-18T22:16:07Z ERROR Cannot reply to client: IoError(Custom { kind: InvalidInput, error: LengthDelimitedCodecError })
How many jobs were there in HyperQueue at the time it started to return these errors? The error seems to lead to a security message limit in HQ that avoids sending too large messages.
Btw, the dashboard is highly experimental and hasn't been updated for a long time, so it's possible that it's misbehaving.
There were 548 jobs or 741 jobs in queue (failed jobs were resubmitted). I don't remember exactly if it was the former or the latter.
Good to know that the dashboard is not ready for primetime.
hq server panick after few hours of running: