LiyaoTang / contrastBoundary

Contrastive Boundary Learning for Point Cloud Segmentation (CVPR2022)
MIT License
143 stars 12 forks source link

Where to set the gpu device ID? #14

Closed whuhxb closed 2 years ago

whuhxb commented 2 years ago

@LiyaoTang Hi, where to set the gpu device ID? Thanks.

LiyaoTang commented 2 years ago

Hi,

passing via command would do. e.g. --gpu 0,1,2,3 would use the first 4 gpus.

whuhxb commented 2 years ago

@LiyaoTang I have checked using nvidia-smi command, the GPU cards with 16g are available, but it still occurs the problem like this.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation gpu_0/model/resnet_backbone/res1_input_conv/L2Loss: node gpu_0/model/resnet_backbone/res1_input_conv/L2Loss (defined at /export/home/hanxiaobing/Documents/PlaneNet_PlaneRCNN/DGCNN_PointNet2/SensatUrban/contrastBoundary/tensorflow_sensaturban/models/basic_operators.py:128) was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1, /job:localhost/replica:0/task:0/device:XLA_GPU:2, /job:localhost/replica:0/task:0/device:XLA_GPU:3, /job:localhost/replica:0/task:0/device:XLA_GPU:4 ]. Make sure the device specification refers to a valid device. [[gpu_0/model/resnet_backbone/res1_input_conv/L2Loss]]

LiyaoTang commented 2 years ago

Could you show me the command you use to run the code? As well as the output of nvidia-smi? (This should also be logged at the beginning of the log.)

whuhxb commented 2 years ago

@LiyaoTang I use the following command like this: srun -p DGXq -n 1 -w node20 python main.py -c config.sensaturban.conv_0 --gpu 6

srun -p DGXq -n 1 -w node20 nvidia-smi Thu Jun 2 17:13:56 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 | | N/A 42C P0 137W / 300W | 15648MiB / 16160MiB | 48% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 | | N/A 49C P0 219W / 300W | 15664MiB / 16160MiB | 63% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 | | N/A 45C P0 120W / 300W | 14428MiB / 16160MiB | 55% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 | | N/A 38C P0 140W / 300W | 15652MiB / 16160MiB | 60% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 | | N/A 41C P0 180W / 300W | 14404MiB / 16160MiB | 97% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 | | N/A 55C P0 259W / 300W | 9724MiB / 16160MiB | 52% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 | | N/A 43C P0 43W / 300W | 0MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 | | N/A 55C P0 258W / 300W | 15458MiB / 16160MiB | 99% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

whuhxb commented 2 years ago

@LiyaoTang I have sent the log file via e-mail to you. Thanks.

LiyaoTang commented 2 years ago

Hi, it is subtle that you need to specify with --gpu ,6

With your command, you can check in the log that the config should be set to use the first 6 gpus (0-5).

whuhxb commented 2 years ago

@LiyaoTang Hi, if I just want to use GPU card 6 and 7, should I specify with --gpu 6,7 right?

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:5A:00.0 Off | 0 | | N/A 34C P0 84W / 250W | 8723MiB / 16160MiB | 37% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... On | 00000000:5E:00.0 Off | 0 | | N/A 31C P0 72W / 250W | 15599MiB / 16160MiB | 34% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-PCIE... On | 00000000:62:00.0 Off | 0 | | N/A 37C P0 97W / 250W | 15707MiB / 16160MiB | 37% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-PCIE... On | 00000000:66:00.0 Off | 0 | | N/A 26C P0 37W / 250W | 1349MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-PCIE... On | 00000000:B5:00.0 Off | 0 | | N/A 51C P0 171W / 250W | 16117MiB / 16160MiB | 97% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-PCIE... On | 00000000:B9:00.0 Off | 0 | | N/A 63C P0 185W / 250W | 16117MiB / 16160MiB | 97% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-PCIE... On | 00000000:BD:00.0 Off | 0 | | N/A 23C P0 24W / 250W | 4MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-PCIE... On | 00000000:C1:00.0 Off | 0 | | N/A 24C P0 26W / 250W | 4MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

But with this setting, I still met the following errors like this.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation gpu_0/model/resnet_backbone/res1_input_conv/L2Loss: node gpu_0/model/resnet_backbone/res1_input_conv/L2Loss (defined at /export/home/hanxiaobing/Documents/PlaneNet_PlaneRCNN/DGCNN_PointNet2/SensatUrban/contrastBoundary/tensorflow_sensaturban/models/basic_operators.py:128) was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. Make sure the device specification refers to a valid device. [[gpu_0/model/resnet_backbone/res1_input_conv/L2Loss]]

Errors may have originated from an input operation. Input Source operations connected to node gpu_0/model/resnet_backbone/res1_input_conv/L2Loss: model/resnet_backbone/res1_input_conv/weights/read (defined at /export/home/hanxiaobing/Documents/PlaneNet_PlaneRCNN/DGCNN_PointNet2/SensatUrban/contrastBoundary/tensorflow_sensaturban/models/basic_operators.py:96)

It seems that the designated GPU cards are not used. I have no idea how to specify the GPU cards. Thanks.

LiyaoTang commented 2 years ago

Hi, could you verify that the the provided model, eg s3dis.conv_0, can normally run with the specified device? Do you change anything regarding the config file?

whuhxb commented 2 years ago

Hi @LiyaoTang ;

I will have a try. In addition, another question, how to obtain the groundtruth boundary for point clouds?

Thanks.