alexandermichels commented 1 year ago

It appears that if Core cannot connect to any single HPC when submitting a job, it will crash. Error stack below:

1676474207QdpNk: [event] JOB_QUEUED job [1676474207QdpNk] is queued, waiting for registration
10.0.2.4 - - [15/Feb/2023:15:16:47 +0000] "POST /job/1676474207QdpNk/submit HTTP/1.1" 200 945 "-" "-"
10.0.2.4 - - [15/Feb/2023:15:16:47 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1146 "-" "-"
10.0.2.4 - - [15/Feb/2023:15:16:47 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1146 "-" "-"
1676474207QdpNk: [event] JOB_REGISTERED job [1676474207QdpNk] is registered with the supervisor, waiting for initialization
10.0.2.4 - - [15/Feb/2023:15:16:58 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1378 "-" "-"
10.0.2.4 - - [15/Feb/2023:15:17:08 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1378 "-" "-"
Error: Timed out while waiting for handshake
    at Timeout._onTimeout (/job_supervisor/node_modules/ssh2/lib/client.js:695:19)
    at listOnTimeout (node:internal/timers:557:17)
    at processTimers (node:internal/timers:500:7)
Error: Not connected
    at Client.exec (/job_supervisor/node_modules/ssh2/lib/client.js:722:11)
    at /job_supervisor/node_modules/node-ssh/lib/cjs/index.js:252:24
    at new Promise (<anonymous>)
    at NodeSSH.execCommand (/job_supervisor/node_modules/node-ssh/lib/cjs/index.js:251:16)
    at Supervisor.<anonymous> (/job_supervisor/production/src/Supervisor.js:179:55)
    at step (/job_supervisor/production/src/Supervisor.js:33:23)
    at Object.throw (/job_supervisor/production/src/Supervisor.js:14:53)
    at rejected (/job_supervisor/production/src/Supervisor.js:6:65)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

We need better error handling for these cases so it doesn't bring down Core. Off of the top of my head, we could:

We could just reject the job and tell the user that the HPC couldn't be connected to
Add code to do an exponential backoff on the connection. So if we can't connect, we try again in 1 second, then 2 seconds, 4, 8, ... until the job can go through

alexandermichels commented 9 months ago

105 currently being tested to solve this issue.

alexandermichels commented 6 months ago

@JTSIV1 This is solved by https://github.com/cybergis/cybergis-compute-core/pull/108, correct? Only thing left to do is have the SDK catch the error?

cybergis / cybergis-compute-core

[Bug] Compute Crashes if it Can't Connect to HPC #85

105 currently being tested to solve this issue.