Closed hosseinfani closed 2 years ago
@VaghehDashti I changed the title to explain the work you're doing better. I create another issue for vBnn with negative sampling.
`Train on 313450 samples, validate on 55315 samples
Epoch 1/2 2021-10-26 23:42:07.876038: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-10-26 23:42:07.876698: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-10-26 23:42:07.876858: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-10-26 23:42:07.877069: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-10-26 23:42:07.877236: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-10-26 23:42:07.877347: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-10-26 23:42:07.877453: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-10-26 23:42:07.878056: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-10-26 23:42:07.878085: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices...
2021-10-26 23:42:07.880985: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/usr/local/lib/python3.6/dist-packages/keras/engine/training.py:2470: UserWarning: Model.state_updates
will be removed in a future version. This property should not be used in TensorFlow 2.0, as updates
are applied automatically.
warnings.warn('Model.state_updates
will be removed in a future version. '
313450/313450 - 27052s - loss: 3542.8094 - val_loss: 2542.8243 Epoch 2/2 313450/313450 - 27863s - loss: 1215.2498 - val_loss: 39.8931`
Either the TFL's code has issues or there is an issue with the server, but vBnn's code does not utilize the GPUs.
TODO: with 20 epochs
The problem of not being able to utilize the gpus has been solved, however, an OOM error occurs when the data is being transferred to the GPU.
55743 is the number of experts in the dblp dataset after filtering.
@VaghehDashti Would you please do not paste the whole error log in the issue pages and just simply attach them as a log file? Thank you.
I was able to change the VAE.py in order to utilize all the GPUs to prevent the OOM issues. However, I am getting different errors now. I am getting two different errors with different sizes of mini batches! dblp-test-16.log dblp-test-32.log (also with mini batch 64)
I am trying to run radin's code on his dataset on the server and getting weird errors. (I have been able to run his code on my workstation without any errors!) radin-gpu-64.log radin-gpu-4951.log radin-gpu-1.log
I think it's better to leave radin's implementation and implement variational nn ourselves. issue #75
This is the tutorial I followed to implement variational bayesian neural networks with pytorch. https://joshfeldman.net/WeightUncertainty/ Weight Uncertainty in Neural Networks.zip
Radin's code has issues when reproducing the results. We decided to code it from scratch. Look at #75
We study the effect of negative sampling on variational neural nets