MolSSI / QCFractal

A distributed compute and database platform for quantum chemistry.
https://molssi.github.io/QCFractal/
BSD 3-Clause "New" or "Revised" License
143 stars 47 forks source link

Compute managers shutting down early #784

Closed peastman closed 7 months ago

peastman commented 8 months ago

I'm seeing a lot of cases of compute managers shutting down. The logs report a server issue:

[2023-10-26 12:43:37 PDT]     INFO: ComputeManager: Executor local_executor has 0 active tasks and 3 open slots
[2023-10-26 12:43:37 PDT]  WARNING: ComputeManager: Acquisition of new tasks failed: Unable to refresh JWT authorization token! This is a server issue!!
[2023-10-26 12:44:03 PDT]  WARNING: ComputeManager: Heartbeat failed: Unable to refresh JWT authorization token! This is a server issue!!. QCFractal server down?
[2023-10-26 12:44:03 PDT]  WARNING: ComputeManager: Missed 6 heartbeats so far
[2023-10-26 12:44:03 PDT]  WARNING: ComputeManager: Too many failed heartbeats, shutting down.

What could be causing this?

bennybp commented 8 months ago

This was a bug on my part. I just pushed a fix this afternoon when I saw the same thing (#783).

I will increase the refresh token time in the meantime, which will help until the next release (which I want to make pretty soon)

bennybp commented 8 months ago

I did just increase the refresh token time, but currently-running managers may see this if they run longer than 24 hours. Newly-created managers should be good for a month.

peastman commented 8 months ago

Thanks! Since the longest job allowed in any queue is seven days, I think a month should be plenty.

bennybp commented 7 months ago

Fixed and released in v0.52