awslabs / benchmark-ai

Anubis (formerly known as Benchmark AI), measures the goodness of machine learning workloads
Apache License 2.0
17 stars 6 forks source link

[Customer Request] Trouble Running Script Mode Benchmark #890

Open Chancebair opened 5 years ago

Chancebair commented 5 years ago

The Palo Alto team is having issues running a benchmark. The executor pod is stuck in "Pending" image Logs: image

Toml and benchmark script here: https://drive.corp.amazon.com/personal/chanbair/Anubis

gavinmbell commented 5 years ago

If there are no appropriate resource in the AZ they have been set to use, that usually leads to hanging executions.

gavinmbell commented 5 years ago

The first thing I would do is have them specify the az id explicitly to an az that has the resources that they are looking to use. I am not sure how that is done, but I know it is doable, other AWS savy engineers knew how to do that mapping.

It can been seen that they have set the az id to be aws_zone_id="use1-az6" Can they confirm the presence of the resources they need there?

haohanchen-aws commented 5 years ago

I think we have p3dn.24xlarge at use1-az6(which is us-east-1b). What we're trying to do is to run single node experiment on anubis. And the status stuck on "job is pending initialization" and then failed image

The mertics on grafana showed as follow: image image

The pods looked like this, it stuck on init, then turned to error and a new pod will occur with init again(marked in red) image

The description of the pod is: image image image image

The instance p3dn.24xlarge has been generated, it can also be found in isengard image

I'm running this with the basic mode(not script mode, I was trying to use script mode because I thought the problem would be the path to the training script). The toml file is https://drive.corp.amazon.com/personal/chehaoha

perdasilva commented 5 years ago

I would also suggest running a kubectl log -c benchmark