codeamp / circuit

CodeAmp API. Built with Golang, GraphQL, GORM and Socket-IO
Apache License 2.0
21 stars 6 forks source link

Jobs have no timeout or error handling #443

Open aballman opened 5 years ago

aballman commented 5 years ago

Describe the bug One shot jobs have no timeout and no error checking. If there is an issue where the pod cannot come online, CodeAmp waits on it forever. The result is that a kubernetes worker is stuck and must be destroyed. Unfortunately there is no current way of knowing which job is on which worker so it is a dangerous operation and is ideally done when no other deploys are ongoing (or all are stuck)