Open ChristinaLK opened 4 years ago
That is an interesting question and I don't know the answer.
I assume that tensorflow won't be able to see the other GPU's and so GPU0 will be the first GPU that it can see.
I added the following lines into the script to see what GPU's tensorflow can see.
print("GPU Devices:")
print(tf.config.list_physical_devices('GPU'))
It is in queue and I'll report back when it is done.
The test showed that tf.device("/gpu:0")
will refer to the first GPU that HTCondor has assigned and not the first GPU on the server. So this is working correctly.
In case it is useful, here is more detail.
The code to print the list of devices that I wrote above was only for Tensorflow 2.0+ and it had to be changed to what was below for tensorflow 1.4
from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
gpus = [x.name for x in local_device_protos if x.device_type == 'GPU']
print(gpus)
The output from this job (12717904.0
) was
['/device:GPU:0']
In the stderr file, Tensorflow says that it assigned pci bus id 0000:5e:00.0
to GPU:0
...
2020-03-11 21:51:15.974258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10312 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:5e:00.0, compute capability: 7.5)
...
When we look at the GPUs on the machine we get that this PCI bus id belongs to CUDA1 and not CUDA0.
$ condor_status -long gitter2002.chtc.wisc.edu | grep -i 0000:5e:00.0
CUDA1DevicePciBusId = "0000:5E:00.0"
So, to sum up, this server has 4 GPU's (CUDA0, CUDA1, CUDA2, CUDA3). HTcondor assigned this job the 2nd GPU i.e. CUDA1 and Tensorflow mapped that to GPU0.
Let me know if you need anything else.
awesome, thanks @sameerd ! I'll pass this on.
thanks @sameerd for looking into this, Christina asked about this on my behalf. Do you recommend that users always use tf.device to instruct tf to only use the GPU that has been allocated to the job by HTCondor? I have someone who saw output in their log from tf that looks like tf is trying to use all the GPUs on the machine the job landed on
@jmvera255 tf.device
is mainly used to determine whether the computations are placed on the CPU or the GPU. Looking through the logs of a test case, Tensorflow only sees the GPU's that are allocated to its own job and tf.device("/gpu:0")
will be the first of these.
If someone's logs look like Tensorflow was trying to use all the GPUs then either
CUDA_VISIBLE_DEVICES
. Tensorflow automatically reads this variable know which GPUs to map to gpu:0
. So maybe this variable is incorrect?I'm not sure what else could be causing tensorflow to use all the GPUs.
https://github.com/CHTC/templates-GPUs/blob/21c9139a9c24013e84cc62f10b3deb20eec8c740/docker/tensorflow_python/test_tensorflow.py#L19
Does this line make this script ALWAYS use the "first" GPU on a server? What if HTCondor has assigned you a different one (i.e. gpu device 3 instead of gpu device 0)?
@sameerd