Open perdasilva opened 4 years ago
Credential error happens with data-puller like this We cannot see this error unless we use kubectl logs Usually this would be fixed by delete and restart the pod. mrcnn_singlenode.toml.txt
Yeah, it seems the problem is kube2iam: https://github.com/jtblin/kube2iam/issues/136 - I'll post a PR with a work around
ok - so I've had a deeper look (even though I haven't been able to reproduce the issue, yet). The puller is an init container. According to the k8s documentation: "If a Pod’s init container fails, Kubernetes repeatedly restarts the Pod until the init container succeeds. However, if the Pod has a restartPolicy of Never, Kubernetes does not restart the Pod". If the benchmark pod is initiating, this can only mean that the restartPolicy is Never.
@haohanchen-yagao it would be good if you could describe the pod when this issue happens, so we can try to vefify the restart policy. If it is indeed Never, we should also update the horovod job template in the executor to set the restart policy to OnFailure.
Currently, it's not necessarily clear to the user if an error has occured along the way. We'll use this ticket to track how we can increase this visibility.
Note: Status messages should be emitted to this status topic.
There are a few different sources for errors, including, but not limited to:
On the python side, most (if not all) services extend the the KafkaService class, which contains utility method that can be used to emit status messages.
For the bff, there is a function that can be used to emit status events.
My suggestion is to go through each of the services and ensure that whichever method is handling an event appropriately catches any errors and emits a helpful error status message.