CovertLab / sisyphus

Eternally execute tasks
0 stars 0 forks source link

timeout Sisyphus storage ops and Docker pulls #26

Open 1fish2 opened 4 years ago

1fish2 commented 4 years ago

Sisyphus GCS downloads got stuck in #24 due to a bug: missing a Google Client library. The worker would get stuck, never complete, and never shut itself down. [Why didn't the lack of a library throw an exception?]

That bug is fixed but we ought to bound the problems that can be caused during file downloads and uploads and Docker image pulls, e.g. if the remote server is slow.

Triggers:

  1. A request from Gaia (via Kafka) to terminate the current task.
  2. A timer expires.

Approaches from easiest to most robust:

  1. In Sisyphus, check these triggers before each file transfer or Docker image pull. This is straightforward other than picking the timeout duration and whether it should be per file or total. It would handle most cases but not the bug that caused #24.
  2. In Sisyphus, do the file transfers and Docker pull in separate threads and be prepared to kill them on these triggers. The file cleanup code might need to be more careful.
  3. Make Gaia able to delete a stuck worker node, esp. once it becomes responsible for starting and stopping Sisyphus workers.
1fish2 commented 4 years ago

BTW, I'm inclined to implement the first alternative, and not urgently.