cybergis / cybergis-compute-core

Apache License 2.0
7 stars 6 forks source link

Exponential backoff implementation #105

Closed JTSIV1 closed 4 months ago

JTSIV1 commented 7 months ago
alexandermichels commented 5 months ago

Hey @JTSIV1 , great job so far. We discussed the issue being that the job doesn't emit the job end event.

From looking over the code, I think maybe what we need to do is in the portion of the code that says "we waited too long give up" (this if: https://github.com/cybergis/cybergis-compute-core/pull/105/files#diff-5483e642678bd804cbce6f7ada5dbc64abb41ec9df389d9986471455ba9ac1fcR147) is to emit the JOB_FAILED event ourselves with a message like "Failed to connect to HPC".

Not sure which of these two syntaxes used elsewhere in the supervisor should be used: (1) self.emitter.registerEvents (https://github.com/cybergis/cybergis-compute-core/blob/v2/src/Supervisor.ts#L68) or (2) this.maintainerMasterEventEmitter.emit(...) (https://github.com/cybergis/cybergis-compute-core/blob/3247c1c80d082086f712c7a02e74035abb195080/src/Supervisor.ts#L191). Spend a bit of time looking into the differences between those two and maybe @zimo-xiao can offer a bit of insight.

If this works you should also be able to remove the || exit part in this line (https://github.com/cybergis/cybergis-compute-core/pull/105/files#diff-5483e642678bd804cbce6f7ada5dbc64abb41ec9df389d9986471455ba9ac1fcR195) because the job fail event being emitted should flip the job to isEnd.

Something interesting to check with the current code: does your current "job fail" show a job fail event in the database? I'm guessing not because it seems that the Emitter is the one updating that in the DB. So one way to check if this change is working is to pull up the database and go to the events table to look for these JOB_FAILED/JOB_ENDED events.

alexandermichels commented 4 months ago

Abandoning this effort in favor of: https://github.com/cybergis/cybergis-compute-core/pull/108