It4innovations / hyperqueue

Scheduler for sub-node tasks for HPC systems with batch scheduling
https://it4innovations.github.io/hyperqueue
MIT License
277 stars 23 forks source link

hq server panick after few hours of running #543

Closed ghuls closed 1 year ago

ghuls commented 1 year ago

hq server panick after few hours of running:

$ hq server start --host 10.118.228.68                                                                                                                                                                                                                      

2023-01-11T17:47:52Z INFO No online server found, starting a new server
2023-01-11T17:47:52Z INFO Saving access file as '/home/user/.hq-server/003/access.json'
+------------------+--------------------------------------+
| Server directory | /home/user/.hq-server |
| Server UID       | QEHbE2                               |
| Host             | 10.118.228.68                        |
| Pid              | 2748566                              |
| HQ port          | 34803                                |
| Workers port     | 44075                                |
| Start date       | 2023-01-11 17:47:52 UTC              |
| Version          | 0.13.0                               |
+------------------+--------------------------------------+

2023-01-11T17:48:03Z INFO Worker 1 registered from 10.118.228.68:45078
2023-01-11T17:48:03Z WARN Worker 1 belongs to an unknown allocation 60033458
2023-01-11T17:48:32Z INFO Worker 2 registered from 10.118.228.76:56608
2023-01-11T17:48:32Z WARN Worker 2 belongs to an unknown allocation 60034375
2023-01-11T18:37:28Z INFO Worker 3 registered from 10.118.228.80:41310
2023-01-11T18:37:28Z WARN Worker 3 belongs to an unknown allocation 60034389
2023-01-11T18:37:46Z INFO Worker 4 registered from 10.118.228.81:60110
2023-01-11T18:37:46Z WARN Worker 4 belongs to an unknown allocation 60034390
2023-01-11T18:38:07Z INFO Worker 5 registered from 10.118.228.70:55548
2023-01-11T18:38:07Z WARN Worker 5 belongs to an unknown allocation 60034391
2023-01-11T18:38:28Z INFO Worker 6 registered from 10.118.228.72:54806
2023-01-11T18:38:28Z WARN Worker 6 belongs to an unknown allocation 60034392
2023-01-11T18:38:45Z INFO Worker 7 registered from 10.118.228.73:46906
2023-01-11T18:38:45Z WARN Worker 7 belongs to an unknown allocation 60034393
2023-01-11T18:39:04Z INFO Worker 8 registered from 10.118.228.77:50892
2023-01-11T18:39:04Z WARN Worker 8 belongs to an unknown allocation 60034394
2023-01-11T18:39:51Z INFO Worker 9 registered from 10.118.228.78:51762
2023-01-11T18:39:51Z WARN Worker 9 belongs to an unknown allocation 60034395
2023-01-11T21:17:07Z INFO Worker 10 registered from 10.118.228.75:33310
2023-01-11T21:17:07Z WARN Worker 10 belongs to an unknown allocation 60034406
thread 'main' panicked at 'assertion failed: tx.send(response).await.is_ok()', /software/hyperqueue/crates/hyperqueue/src/server/client/mod.rs:128:17                                                                    
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted
ghuls commented 1 year ago

Any idea what could be the cause of this?

Kobzol commented 1 year ago

I can't be sure without seeing more (e.g. server) logs. These can be enabled by the following execution:

$ RUST_LOG=hq=debug,tako=debug RUST_BACKTRACE=full hq server start

It seems like it was some race condition where we tried to send a message to a client that was disconnecting at the same time. In any case, this should not crash the server. I changed this behavior in this PR.

ghuls commented 1 year ago

I think I now know how I managed to create this panic with the version before the patch.

Running hq dashboard caused twice the crash (sometimes it ran fine).

ghuls commented 1 year ago

Running the latest hq now: commit ccec902f6a45fb6230bf709bdf2cf4d22aad8ff7

The server doesn't panic anymore, but after I killed one worked instance, starting hq dashboard will close hq dashboard after one second and return ERROR Cannot reply to client: IoError(Custom { kind: InvalidInput, error: LengthDelimitedCodecError }) in the server:

$ RUST_LOG=hq=debug,tako=debug RUST_BACKTRACE=full hq dashboard
[2023-01-18T22:21:54.320Z DEBUG tako::internal::transfer::auth] Worker authorization started
[2023-01-18T22:21:54.320Z DEBUG tako::internal::transfer::auth] Challenge verification started
[2023-01-18T22:21:54.320Z DEBUG tako::internal::transfer::auth] Challenge verification finished

$ 
$ hq server
...
2023-01-18T11:34:12Z INFO Worker 1 registered from 10.118.228.69:46330
2023-01-18T11:34:12Z WARN Worker 1 belongs to an unknown allocation 60035964
2023-01-18T11:35:09Z INFO Worker 2 registered from 10.118.228.70:33974
2023-01-18T11:35:09Z WARN Worker 2 belongs to an unknown allocation 60035967
2023-01-18T11:36:37Z INFO Worker 3 registered from 10.118.228.76:53584
2023-01-18T11:36:37Z WARN Worker 3 belongs to an unknown allocation 60036535
2023-01-18T11:36:45Z INFO Worker 4 registered from 10.118.228.79:47034
2023-01-18T11:36:45Z WARN Worker 4 belongs to an unknown allocation 60036536
2023-01-18T11:36:54Z INFO Worker 5 registered from 10.118.228.80:40800
2023-01-18T11:36:54Z WARN Worker 5 belongs to an unknown allocation 60036537
2023-01-18T11:37:01Z INFO Worker 6 registered from 10.118.228.81:54800
2023-01-18T11:37:01Z WARN Worker 6 belongs to an unknown allocation 60036538
2023-01-18T11:37:06Z INFO Worker 7 registered from 10.118.230.3:58912
2023-01-18T11:37:17Z INFO Worker 8 registered from 10.118.230.3:58928
2023-01-18T11:37:32Z INFO Worker 9 registered from 10.118.228.73:60644
2023-01-18T11:37:32Z WARN Worker 9 belongs to an unknown allocation 60036539
2023-01-18T22:14:44Z INFO Worker 7 connection closed (connection: 10.118.230.3:58912)
2023-01-18T22:15:28Z ERROR Cannot reply to client: IoError(Custom { kind: InvalidInput, error: LengthDelimitedCodecError })
2023-01-18T22:16:07Z ERROR Cannot reply to client: IoError(Custom { kind: InvalidInput, error: LengthDelimitedCodecError })
Kobzol commented 1 year ago

How many jobs were there in HyperQueue at the time it started to return these errors? The error seems to lead to a security message limit in HQ that avoids sending too large messages.

Btw, the dashboard is highly experimental and hasn't been updated for a long time, so it's possible that it's misbehaving.

ghuls commented 1 year ago

There were 548 jobs or 741 jobs in queue (failed jobs were resubmitted). I don't remember exactly if it was the former or the latter.

Good to know that the dashboard is not ready for primetime.