aws-deepracer-community / deepracer-for-cloud

Creates an AWS DeepRacing training environment which can be deployed in the cloud, or locally on Ubuntu Linux, Windows or Mac.
MIT No Attribution
337 stars 184 forks source link

Sagemaker frozen after printing "Checkpoint> Saving in path=['./checkpoint/agent/0_Step-0.ckpt']" #53

Closed yifan-w closed 3 years ago

yifan-w commented 3 years ago

Hi there. I'm running on Ubuntu 18.04 with an Nvidia Tesla P4 gpu. I've managed to get the containers running for dr-start-training with default configs and local mode. However the log froze after printing this message:

Checkpoint> Saving in path=['./checkpoint/agent/0_Step-0.ckpt']

I checked the checkpoint path and all the files were not changed at all (neither size nor modified time), while the cpu remains at 100% consumption by the python process.

I've located the source of this log message to line 61 of training_worker.py

# save initial checkpoint                                           
graph_manager.save_checkpoint()

but couldn't debug further from there.

Any idea what could be causing this problem?

Thanks

larsll commented 3 years ago

@yifan-w - for general support I would suggest to head over to deepracing.io and the Slack community.

yifan-w commented 3 years ago

@yifan-w - for general support I would suggest to head over to deepracing.io and the Slack community.

Got it. Thanks.

yifan-w commented 3 years ago

@yifan-w - for general support I would suggest to head over to deepracing.io and the Slack community.

Got it. Thanks.