fani-lab / OpeNTF

Neural machine learning methods for Team Formation problem.
Other
18 stars 13 forks source link

Reproduce Rad et al's (vBnn) baseline on all datasets #23

Closed hosseinfani closed 2 years ago

hosseinfani commented 2 years ago

We study the effect of negative sampling on variational neural nets

hosseinfani commented 2 years ago

@VaghehDashti I changed the title to explain the work you're doing better. I create another issue for vBnn with negative sampling.

VaghehDashti commented 2 years ago

`Train on 313450 samples, validate on 55315 samples

Epoch 1/2 2021-10-26 23:42:07.876038: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

2021-10-26 23:42:07.876698: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

2021-10-26 23:42:07.876858: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

2021-10-26 23:42:07.877069: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

2021-10-26 23:42:07.877236: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

2021-10-26 23:42:07.877347: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

2021-10-26 23:42:07.877453: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

2021-10-26 23:42:07.878056: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

2021-10-26 23:42:07.878085: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices...

2021-10-26 23:42:07.880985: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. /usr/local/lib/python3.6/dist-packages/keras/engine/training.py:2470: UserWarning: Model.state_updates will be removed in a future version. This property should not be used in TensorFlow 2.0, as updates are applied automatically. warnings.warn('Model.state_updates will be removed in a future version. '

313450/313450 - 27052s - loss: 3542.8094 - val_loss: 2542.8243 Epoch 2/2 313450/313450 - 27863s - loss: 1215.2498 - val_loss: 39.8931`

Either the TFL's code has issues or there is an issue with the server, but vBnn's code does not utilize the GPUs.

VaghehDashti commented 2 years ago

TODO: with 20 epochs

VaghehDashti commented 2 years ago

The problem of not being able to utilize the gpus has been solved, however, an OOM error occurs when the data is being transferred to the GPU.

log.txt

55743 is the number of experts in the dblp dataset after filtering.

hosseinfani commented 2 years ago

@VaghehDashti Would you please do not paste the whole error log in the issue pages and just simply attach them as a log file? Thank you.

VaghehDashti commented 2 years ago

I was able to change the VAE.py in order to utilize all the GPUs to prevent the OOM issues. However, I am getting different errors now. I am getting two different errors with different sizes of mini batches! dblp-test-16.log dblp-test-32.log (also with mini batch 64)

VaghehDashti commented 2 years ago

I am trying to run radin's code on his dataset on the server and getting weird errors. (I have been able to run his code on my workstation without any errors!) radin-gpu-64.log radin-gpu-4951.log radin-gpu-1.log

VaghehDashti commented 2 years ago

I think it's better to leave radin's implementation and implement variational nn ourselves. issue #75

VaghehDashti commented 2 years ago

This is the tutorial I followed to implement variational bayesian neural networks with pytorch. https://joshfeldman.net/WeightUncertainty/ Weight Uncertainty in Neural Networks.zip

hosseinfani commented 2 years ago

Radin's code has issues when reproducing the results. We decided to code it from scratch. Look at #75