It appears that if Core cannot connect to any single HPC when submitting a job, it will crash. Error stack below:
1676474207QdpNk: [event] JOB_QUEUED job [1676474207QdpNk] is queued, waiting for registration
10.0.2.4 - - [15/Feb/2023:15:16:47 +0000] "POST /job/1676474207QdpNk/submit HTTP/1.1" 200 945 "-" "-"
10.0.2.4 - - [15/Feb/2023:15:16:47 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1146 "-" "-"
10.0.2.4 - - [15/Feb/2023:15:16:47 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1146 "-" "-"
1676474207QdpNk: [event] JOB_REGISTERED job [1676474207QdpNk] is registered with the supervisor, waiting for initialization
10.0.2.4 - - [15/Feb/2023:15:16:58 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1378 "-" "-"
10.0.2.4 - - [15/Feb/2023:15:17:08 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1378 "-" "-"
Error: Timed out while waiting for handshake
at Timeout._onTimeout (/job_supervisor/node_modules/ssh2/lib/client.js:695:19)
at listOnTimeout (node:internal/timers:557:17)
at processTimers (node:internal/timers:500:7)
Error: Not connected
at Client.exec (/job_supervisor/node_modules/ssh2/lib/client.js:722:11)
at /job_supervisor/node_modules/node-ssh/lib/cjs/index.js:252:24
at new Promise (<anonymous>)
at NodeSSH.execCommand (/job_supervisor/node_modules/node-ssh/lib/cjs/index.js:251:16)
at Supervisor.<anonymous> (/job_supervisor/production/src/Supervisor.js:179:55)
at step (/job_supervisor/production/src/Supervisor.js:33:23)
at Object.throw (/job_supervisor/production/src/Supervisor.js:14:53)
at rejected (/job_supervisor/production/src/Supervisor.js:6:65)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
We need better error handling for these cases so it doesn't bring down Core. Off of the top of my head, we could:
We could just reject the job and tell the user that the HPC couldn't be connected to
Add code to do an exponential backoff on the connection. So if we can't connect, we try again in 1 second, then 2 seconds, 4, 8, ... until the job can go through
It appears that if Core cannot connect to any single HPC when submitting a job, it will crash. Error stack below:
We need better error handling for these cases so it doesn't bring down Core. Off of the top of my head, we could: