awslabs / benchmark-ai

Anubis (formerly known as Benchmark AI), measures the goodness of machine learning workloads
Apache License 2.0
16 stars 6 forks source link

[Improvement] Increase error state visibility to end users #996

Open perdasilva opened 4 years ago

perdasilva commented 4 years ago

Currently, it's not necessarily clear to the user if an error has occured along the way. We'll use this ticket to track how we can increase this visibility.

Note: Status messages should be emitted to this status topic.

There are a few different sources for errors, including, but not limited to:

On the python side, most (if not all) services extend the the KafkaService class, which contains utility method that can be used to emit status messages.

For the bff, there is a function that can be used to emit status events.

My suggestion is to go through each of the services and ensure that whichever method is handling an event appropriately catches any errors and emits a helpful error status message.

haohanchen-aws commented 4 years ago

Credential error happens with data-puller like this image We cannot see this error unless we use kubectl logs Usually this would be fixed by delete and restart the pod. mrcnn_singlenode.toml.txt

perdasilva commented 4 years ago

Yeah, it seems the problem is kube2iam: https://github.com/jtblin/kube2iam/issues/136 - I'll post a PR with a work around

perdasilva commented 4 years ago

ok - so I've had a deeper look (even though I haven't been able to reproduce the issue, yet). The puller is an init container. According to the k8s documentation: "If a Pod’s init container fails, Kubernetes repeatedly restarts the Pod until the init container succeeds. However, if the Pod has a restartPolicy of Never, Kubernetes does not restart the Pod". If the benchmark pod is initiating, this can only mean that the restartPolicy is Never.

@haohanchen-yagao it would be good if you could describe the pod when this issue happens, so we can try to vefify the restart policy. If it is indeed Never, we should also update the horovod job template in the executor to set the restart policy to OnFailure.