Having trouble running the baseline solution. I was able to synthesize the full dataset, and maintain it on an SSD in my system. I am using an NVIDIA Titan XP (GP102).
When I run the command python3 train_sdnn.py from in the baseline_solution/sdnn_delays/ directory, I see that the script is able to detect my GPU and begin running. It makes it all the way to loss.backward(), where it hangs forever. Once launched and running to this point, I am able to monitor the GPU and I see that device memory is being allocated (noisy and clean data are sent to the device) and the utilization spikes to ~74% (I'm guessing from the score, loss, and gradient calculations). Once it reaches the loss.backward() line, the GPU utilization drops to 0% (and it does not appear that the CPU utilization is higher than normal system behavior). I am unsure why it's hanging on loss.backward(). I allowed it to hang here for a few days before killing the process.
Has anyone else encountered this/know what might be happening here?
Having trouble running the baseline solution. I was able to synthesize the full dataset, and maintain it on an SSD in my system. I am using an NVIDIA Titan XP (GP102).
My System Software Versions:
CUDA Version: 11.4 Python: 3.8.10 pip: 20.0.2 numpy: 1.23.5 soundfile: 0.12.1 librosa: 0.8.1 configparser: 5.3.0 pandas: 1.5.0 torch: 1.13.1 lava: 0.4.1 lava-dl: 0.3.3 h5py: 3.8.0 pillow: 9.5.0
Output
When I run the command
python3 train_sdnn.py
from in thebaseline_solution/sdnn_delays/
directory, I see that the script is able to detect my GPU and begin running. It makes it all the way toloss.backward()
, where it hangs forever. Once launched and running to this point, I am able to monitor the GPU and I see that device memory is being allocated (noisy and clean data are sent to the device) and the utilization spikes to ~74% (I'm guessing from the score, loss, and gradient calculations). Once it reaches theloss.backward()
line, the GPU utilization drops to 0% (and it does not appear that the CPU utilization is higher than normal system behavior). I am unsure why it's hanging onloss.backward()
. I allowed it to hang here for a few days before killing the process.Has anyone else encountered this/know what might be happening here?