Closed yifan-w closed 3 years ago
@yifan-w - for general support I would suggest to head over to deepracing.io and the Slack community.
@yifan-w - for general support I would suggest to head over to deepracing.io and the Slack community.
Got it. Thanks.
@yifan-w - for general support I would suggest to head over to deepracing.io and the Slack community.
Got it. Thanks.
Hi there. I'm running on Ubuntu 18.04 with an Nvidia Tesla P4 gpu. I've managed to get the containers running for dr-start-training with default configs and local mode. However the log froze after printing this message:
I checked the checkpoint path and all the files were not changed at all (neither size nor modified time), while the cpu remains at 100% consumption by the python process.
I've located the source of this log message to line 61 of training_worker.py
but couldn't debug further from there.
Any idea what could be causing this problem?
Thanks