cybergis / cybergis-compute-core

Apache License 2.0
8 stars 6 forks source link

[Bug] Compute Crashes if it Can't Connect to HPC #85

Closed alexandermichels closed 6 months ago

alexandermichels commented 1 year ago

It appears that if Core cannot connect to any single HPC when submitting a job, it will crash. Error stack below:

1676474207QdpNk: [event] JOB_QUEUED job [1676474207QdpNk] is queued, waiting for registration
10.0.2.4 - - [15/Feb/2023:15:16:47 +0000] "POST /job/1676474207QdpNk/submit HTTP/1.1" 200 945 "-" "-"
10.0.2.4 - - [15/Feb/2023:15:16:47 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1146 "-" "-"
10.0.2.4 - - [15/Feb/2023:15:16:47 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1146 "-" "-"
1676474207QdpNk: [event] JOB_REGISTERED job [1676474207QdpNk] is registered with the supervisor, waiting for initialization
10.0.2.4 - - [15/Feb/2023:15:16:58 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1378 "-" "-"
10.0.2.4 - - [15/Feb/2023:15:17:08 +0000] "GET /job/1676474207QdpNk HTTP/1.1" 200 1378 "-" "-"
Error: Timed out while waiting for handshake
    at Timeout._onTimeout (/job_supervisor/node_modules/ssh2/lib/client.js:695:19)
    at listOnTimeout (node:internal/timers:557:17)
    at processTimers (node:internal/timers:500:7)
Error: Not connected
    at Client.exec (/job_supervisor/node_modules/ssh2/lib/client.js:722:11)
    at /job_supervisor/node_modules/node-ssh/lib/cjs/index.js:252:24
    at new Promise (<anonymous>)
    at NodeSSH.execCommand (/job_supervisor/node_modules/node-ssh/lib/cjs/index.js:251:16)
    at Supervisor.<anonymous> (/job_supervisor/production/src/Supervisor.js:179:55)
    at step (/job_supervisor/production/src/Supervisor.js:33:23)
    at Object.throw (/job_supervisor/production/src/Supervisor.js:14:53)
    at rejected (/job_supervisor/production/src/Supervisor.js:6:65)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

We need better error handling for these cases so it doesn't bring down Core. Off of the top of my head, we could:

alexandermichels commented 9 months ago

105 currently being tested to solve this issue.

alexandermichels commented 6 months ago

@JTSIV1 This is solved by https://github.com/cybergis/cybergis-compute-core/pull/108, correct? Only thing left to do is have the SDK catch the error?