Sagemaker Contrainer (tensorflow) Fails to Fully Utilize GPU

gosselind1 commented 1 year ago

Hi, I'm running a full-local install, and am running into the following log message when initializing training:

## Created agent: agent
## Stop physics after creating graph
## Creating session
Creating regular session
2023-04-16 04:16:59.381165: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Identity: GPU CPU XLA_CPU XLA_GPU 
VariableV2: CPU 
Assign: GPU CPU 

Colocation members, user-requested devices, and framework assigned devices, if any:
  main_level/agent/main/online/Variable (VariableV2) /device:GPU:0
  main_level/agent/main/online/Variable/Assign (Assign) /device:GPU:0
  main_level/agent/main/online/Variable/read (Identity) /device:GPU:0
  main_level/agent/main/online/Assign (Assign) /device:GPU:0

2023-04-16 04:16:59.381819: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Identity: GPU CPU XLA_CPU XLA_GPU 
VariableV2: CPU 
Assign: GPU CPU 

Colocation members, user-requested devices, and framework assigned devices, if any:
  main_level/agent/main/target/Variable (VariableV2) /device:GPU:0
  main_level/agent/main/target/Variable/Assign (Assign) /device:GPU:0
  main_level/agent/main/target/Variable/read (Identity) /device:GPU:0
  main_level/agent/main/target/Assign (Assign) /device:GPU:0

Unless I'm misinterpreting this log, it appears that the gpu is failing to be fully utilized for some computation paths.

During training gpu memory consumption does increase, but load on the gpu itself appears to be rather low as reported by nvidia-smi.

Host has a 3080ti, and is running ubuntu 22.04 with driver 530.

gosselind1 commented 1 year ago

Digging deeper, gpu under utilization is probably due to broken builds as a result of the rather ancient version of tensorflow being used.

The tf binaries appear to be targeting version 1.15, due to major code changes made with core libraries. However, these builds of tf also appear to be targeting cuda 11.4, which lacks support in tf due to the version's age.

My digging led me to https://github.com/tensorflow/tensorflow/commit/28feb4df0d4ab386946bdee1a0e5c36cc58246cf Which is a decent starting point for a hacky patch, but probably not a good long-term solution.

larsll commented 1 year ago

Thanks for raising this. Happy to support alternative TF builds if it will make a significant impact; the real location for those PRs would be https://github.com/aws-deepracer-community/deepracer-simapp and https://github.com/aws-deepracer-community/deepracer-sagemaker-container.

Now - one massive reason for GPU under-utilization is the entire way that DeepRacer / Reinforcement Learning is setup. First you collect a set of episodes (20 as default), then the networks are updated. Only during network update the Sagemaker assigned GPU will do anything at all. Looking at total elapsed time this is not a major factor. Many of us train with VRAM solid, ancient GPUs like K40, K80 and M40 for that reason -- real GPU performance does not really make a difference, as long as you can offload most of those calculations to the GPU.

larsll commented 1 year ago

Closing due to no activity.

aws-deepracer-community / deepracer-for-cloud

Sagemaker Contrainer (tensorflow) Fails to Fully Utilize GPU #132