IntelLabs / IntelNeuromorphicDNSChallenge

Intel Neuromorphic DNS Challenge
MIT License
134 stars 28 forks source link

Running `python3 train_sdnn.py` on baseline solution hangs forever #24

Open BujSet opened 11 months ago

BujSet commented 11 months ago

Having trouble running the baseline solution. I was able to synthesize the full dataset, and maintain it on an SSD in my system. I am using an NVIDIA Titan XP (GP102).

My System Software Versions:

CUDA Version: 11.4 Python: 3.8.10 pip: 20.0.2 numpy: 1.23.5 soundfile: 0.12.1 librosa: 0.8.1 configparser: 5.3.0 pandas: 1.5.0 torch: 1.13.1 lava: 0.4.1 lava-dl: 0.3.3 h5py: 3.8.0 pillow: 9.5.0

Output

When I run the command python3 train_sdnn.py from in the baseline_solution/sdnn_delays/ directory, I see that the script is able to detect my GPU and begin running. It makes it all the way to loss.backward(), where it hangs forever. Once launched and running to this point, I am able to monitor the GPU and I see that device memory is being allocated (noisy and clean data are sent to the device) and the utilization spikes to ~74% (I'm guessing from the score, loss, and gradient calculations). Once it reaches the loss.backward() line, the GPU utilization drops to 0% (and it does not appear that the CPU utilization is higher than normal system behavior). I am unsure why it's hanging on loss.backward(). I allowed it to hang here for a few days before killing the process.

Has anyone else encountered this/know what might be happening here?