gpuassert cuda error on larger number of source positions

DavidDiazGuerra / gpuRIR

Python library for Room Impulse Response (RIR) simulation with GPU acceleration

GNU Affero General Public License v3.0

488 stars 94 forks source link

gpuassert cuda error on larger number of source positions #13

Closed mirbagheri closed 4 years ago

mirbagheri commented 4 years ago

I'm trying to simulate RIRs on a 4 x 4 x 4 room with fully reflective walls with a single reciever at the center of the room and 2600 source positions randomly chosen on the unit sphere centered at [2, 2, 2]

gpuRIR was installed using pip on a machine with 128 GB cpu memory and a Nvidia Titan Xp GPU with 12Gb memory with Ubuntu 18.

These are the parameters I use to run gpuRIR.simulateRIR: beta = [0.] 6 nb_img = [2] 3 Tmax = 0.0066 Tdiff = None mic_pattern='omni'

gpuRIR.simulateRIR function sometimes fails with the error: GPUassert: invalid argument /tmp/pip-req-build-93zmrk8r/src/gpuRIR_cuda.cu 795

The error goes away if I reduce the pos_src chunk size to 2000 x 3. This is while GPU memory usage never exceeds 200 mb and the utilization rate is always below 4-5%. Any idea on what might be causing this issue?

DavidDiazGuerra commented 4 years ago

I am not sure, but it could be due to the maximum number of threads that CUDA can launch at once. About the low memory usage and utilization rate, this is due to the fact that you are simulating really short RIRs. You said that you wanted to simulate "fully reflective walls" but with beta = [0.] * 6 and Tmax = 0.0066 you are simulating an anechoic room.

mirbagheri commented 4 years ago

Thanks for the reply. I understand the thread limit, but is there a better way to utilize GPU parallization in scenarios like mine?

You are right! I meant the fully absorptive case.

DavidDiazGuerra commented 4 years ago

gpuRIR is designed to have optimum performance when you have larger reverberations since this is usually the bottleneck of most simulations. In order to better exploit the GPU power in scenarios like yours, maybe it would be good to run several simulations in parallel. CUDA allows you to launch several kernels without waiting for them to finish, but modifying gpuRIR to do so would mean strong changes in its architecture. Therefore, I guess it will be easier using CPU parallelization in your Python script to run several gpuRIR.simulateRIR calls at once. I have never done that, but I think gpuRIR should be thread-safe as long as all the threads are using the same LUT and MixedPrecision modes.

Just out of curiosity, how long it takes to simulate 2000 anechoic simulations?

mirbagheri commented 4 years ago

Thanks for the tip! I tested the CPU parallelization as below and it's working.

import os
from multiprocessing import Pool

def func(cnt):
      import gpuRIR
      gpuRIR.activateMixedPrecision(False)
      gpuRIR.activateLUT(True)
      return gpuRIR.simulateRIR([4, 4, 4], [0.] * 6, np.array([[3, 2, 2]]),
                         2 * np.ones((2000, 3)),
                         [2] * 3, 0.0066, 44100, mic_pattern='omni')

if __name__ == '__main__':
      p = Pool(os.cpu_count())
      results = p.map(func, range(1000))

Note that the gpuRIR should be imported inside func otherwise you get an "Incorrect Initialization" Error. This took 3.49 s ± 14.3 ms to finish with 28 Intel Xeon CPU cores.

GaetanLepage commented 2 years ago

What exactly causes the GPUassert: initialization error /tmp/pip-req-build-i673lsff/src/gpuRIR_cuda.cu 764 ? I am trying to use gpuRIR in a gym.vector.AsyncVectorEnv (which relies on Python's native Process library).

gpuRIR shoud be instanciated inside each process right ?

DavidDiazGuerra commented 2 years ago

I'm afraid I've never worked with gpuRIR (or anything using CUDA) in multiprocessing scenarios, so I cannot offer too much help with this. I made a quick search in Google and it seems like you only can initialize the CUDA context once in your program and many answers in Pytorch issues and questions point here as the solution: https://pytorch.org/docs/master/notes/multiprocessing.html#cuda-in-multiprocessing

Maybe you could use the context arg in the gym.vector.AsyncVectorEnv construction to start your subprocesses in spawn or forkserver mode?

GaetanLepage commented 2 years ago

Thank you so much for this quick answer. I had to change a bit the code for gym's AsyncVectorEnv but it indeed works with context=torch.multiprocessing.get_context('spawn').

Have a great day