CHTC / templates-GPUs

Template job submissions using GPUs in CHTC
MIT License
39 stars 11 forks source link

Question about device option #6

Open ChristinaLK opened 4 years ago

ChristinaLK commented 4 years ago

https://github.com/CHTC/templates-GPUs/blob/21c9139a9c24013e84cc62f10b3deb20eec8c740/docker/tensorflow_python/test_tensorflow.py#L19

Does this line make this script ALWAYS use the "first" GPU on a server? What if HTCondor has assigned you a different one (i.e. gpu device 3 instead of gpu device 0)?

@sameerd

sameerd commented 4 years ago

That is an interesting question and I don't know the answer.

I assume that tensorflow won't be able to see the other GPU's and so GPU0 will be the first GPU that it can see.

I added the following lines into the script to see what GPU's tensorflow can see.

print("GPU Devices:")
print(tf.config.list_physical_devices('GPU'))

It is in queue and I'll report back when it is done.

sameerd commented 4 years ago

The test showed that tf.device("/gpu:0") will refer to the first GPU that HTCondor has assigned and not the first GPU on the server. So this is working correctly.

In case it is useful, here is more detail.

The code to print the list of devices that I wrote above was only for Tensorflow 2.0+ and it had to be changed to what was below for tensorflow 1.4

from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
gpus = [x.name for x in local_device_protos if x.device_type == 'GPU']
print(gpus)

The output from this job (12717904.0) was

['/device:GPU:0']

In the stderr file, Tensorflow says that it assigned pci bus id 0000:5e:00.0 to GPU:0

...
2020-03-11 21:51:15.974258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10312 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:5e:00.0, compute capability: 7.5)
...

When we look at the GPUs on the machine we get that this PCI bus id belongs to CUDA1 and not CUDA0.

$ condor_status -long gitter2002.chtc.wisc.edu | grep -i 0000:5e:00.0
CUDA1DevicePciBusId = "0000:5E:00.0"

So, to sum up, this server has 4 GPU's (CUDA0, CUDA1, CUDA2, CUDA3). HTcondor assigned this job the 2nd GPU i.e. CUDA1 and Tensorflow mapped that to GPU0.

Let me know if you need anything else.

ChristinaLK commented 4 years ago

awesome, thanks @sameerd ! I'll pass this on.

jmvera255 commented 4 years ago

thanks @sameerd for looking into this, Christina asked about this on my behalf. Do you recommend that users always use tf.device to instruct tf to only use the GPU that has been allocated to the job by HTCondor? I have someone who saw output in their log from tf that looks like tf is trying to use all the GPUs on the machine the job landed on

sameerd commented 4 years ago

@jmvera255 tf.device is mainly used to determine whether the computations are placed on the CPU or the GPU. Looking through the logs of a test case, Tensorflow only sees the GPU's that are allocated to its own job and tf.device("/gpu:0") will be the first of these.

If someone's logs look like Tensorflow was trying to use all the GPUs then either

  1. The server they are running on is mis-configured. According to the HT Condor docs, it is supposed to set a variable called CUDA_VISIBLE_DEVICES. Tensorflow automatically reads this variable know which GPUs to map to gpu:0. So maybe this variable is incorrect?
  2. They actually requested all the GPUs in the submit file?

I'm not sure what else could be causing tensorflow to use all the GPUs.