Closed gosselind1 closed 1 year ago
Digging deeper, gpu under utilization is probably due to broken builds as a result of the rather ancient version of tensorflow being used.
The tf binaries appear to be targeting version 1.15, due to major code changes made with core libraries. However, these builds of tf also appear to be targeting cuda 11.4, which lacks support in tf due to the version's age.
My digging led me to https://github.com/tensorflow/tensorflow/commit/28feb4df0d4ab386946bdee1a0e5c36cc58246cf Which is a decent starting point for a hacky patch, but probably not a good long-term solution.
Thanks for raising this. Happy to support alternative TF builds if it will make a significant impact; the real location for those PRs would be https://github.com/aws-deepracer-community/deepracer-simapp and https://github.com/aws-deepracer-community/deepracer-sagemaker-container.
Now - one massive reason for GPU under-utilization is the entire way that DeepRacer / Reinforcement Learning is setup. First you collect a set of episodes (20 as default), then the networks are updated. Only during network update the Sagemaker assigned GPU will do anything at all. Looking at total elapsed time this is not a major factor. Many of us train with VRAM solid, ancient GPUs like K40, K80 and M40 for that reason -- real GPU performance does not really make a difference, as long as you can offload most of those calculations to the GPU.
Closing due to no activity.
Hi, I'm running a full-local install, and am running into the following log message when initializing training:
Unless I'm misinterpreting this log, it appears that the gpu is failing to be fully utilized for some computation paths.
During training gpu memory consumption does increase, but load on the gpu itself appears to be rather low as reported by
nvidia-smi
.Host has a 3080ti, and is running ubuntu 22.04 with driver 530.