Closed arainbilal closed 4 years ago
Hello, Unfortunately, I have no idea why this may be going on. It seems to be a tensorflow related problem though. When I first developed bonnet I had to do the multi-gpu training myself, but now tensorflow has a lot of neat tools to do this automatically. I imagine that from 1.9 to 1.15 there was a breaking in backward compatibility of some API, but unfortunately I am not aware of this, since I have been working with pytorch for the last 2 years. Sorry for not being able to be of more help :/
Thanks; Certainly it looks like a TF issue. I know similar one when upgrading from 1.13 to 1.14 about the device query. I am interested to investigate further and will report if I could find the reason or may be end up in using some other version. There is a slight difference in my setup. I am using TF 1.15 compiled with Cuda 10.1. Usually, this version is tested with Cuda 10.0. I don't see any apparent problem there though but I have few more tests to do. I will close this issue after having some more results or conclusions and may be helpful for someone else.
Closing remarks: I have tested my TF/CUDA setup with Deeplab and found no problems there. This means that I might have to customize few things in the Bonnet to keep the backward compatibility. Since, these local changes could not be generalized for any other setup or Bonnet users, therefore I am closing this issue.
Overview: This issue occurs when calling conv2d_transpose in upsample_layer under layers.py System config: Tensorflow 1.15.2, Ubuntu 18.04, CUDA 10.1, Bazel 0.26.1
Source of this error: I think it has something to do with the assigning op on GPU. I have created the following test to verify if I can use the GPU/CPU to assign variables:
The test code works fine. Any help/guidance to solve this issue will be appreciated. I can confirm that this issue not occur when using the Tensorflow 1.9. It only happens after upgrade to 1.15.2.