Open Chancebair opened 5 years ago
If there are no appropriate resource in the AZ they have been set to use, that usually leads to hanging executions.
The first thing I would do is have them specify the az id explicitly to an az that has the resources that they are looking to use. I am not sure how that is done, but I know it is doable, other AWS savy engineers knew how to do that mapping.
It can been seen that they have set the az id to be aws_zone_id="use1-az6" Can they confirm the presence of the resources they need there?
I think we have p3dn.24xlarge at use1-az6(which is us-east-1b). What we're trying to do is to run single node experiment on anubis. And the status stuck on "job is pending initialization" and then failed
The mertics on grafana showed as follow:
The pods looked like this, it stuck on init, then turned to error and a new pod will occur with init again(marked in red)
The description of the pod is:
The instance p3dn.24xlarge has been generated, it can also be found in isengard
I'm running this with the basic mode(not script mode, I was trying to use script mode because I thought the problem would be the path to the training script). The toml file is https://drive.corp.amazon.com/personal/chehaoha
I would also suggest running a kubectl log
The Palo Alto team is having issues running a benchmark. The executor pod is stuck in "Pending" Logs:
Toml and benchmark script here: https://drive.corp.amazon.com/personal/chanbair/Anubis