bmcfee / resampy

Efficient sample rate conversion in python
https://resampy.readthedocs.io
ISC License
253 stars 36 forks source link

Use parallel=True by default can be problematic in multiprocessing applications #107

Closed riqiang-dp closed 2 years ago

riqiang-dp commented 2 years ago

In my use case an update from resampy 0.2.2 to 0.3.1 broke my program, since resampy is called in a multiprocessing pool. Depending on the version of other packages (I didn't find out the exact reason for the different following behavior), this would either end up making the program really slow (because n_cpu * n_cpu number of processes were created by calling multiprocessing inside multiprocessing), or emit this error: Terminating: fork() called from a process already using GNU OpenMP, this is unsafe.

Can you guys set it to False by default or give some warning about running resampy in multiprocessing scenario?

bmcfee commented 2 years ago

Thanks for raising this - can you provide a bit more context about how you're using the library? (I.e., are you calling it directly, or through some other package?) Would it be easy to just add parallel=False to your call site?

I don't think a warning is necessarily appropriate, but we can certainly consider reverting the default back to parallel=False. The thinking here was that multiprocessing applications like yours are probably not the default / most common setting, but we do want to make it easy to limit multi-core usage for exactly this kind of application.

It does sound like for your application, it would actually make a lot of sense to limit numba core usage globally (and not just for resampy) - that way it would apply to all numba-parallel usage and not just our package. This can be done through a global environment variable as noted in #106 .

riqiang-dp commented 2 years ago

Thanks for the quick reply! So yes, if there's a way to auto-detect being in a subprocess, and disable it accordingly, that would also be great.

Looking at #106, I think what they are seeing is this exact same problem. I'm also using librosa, and librosa hasn't updated their call to resampy.resample to specify parallel. So I don't have the control of turning it off. Took me some time to finally get to the bottom of this...

I'll test numba.set_num_threads(1) to see if it works as well.

bmcfee commented 2 years ago

So yes, if there's a way to auto-detect being in a subprocess, and disable it accordingly, that would also be great.

That seems pretty unlikely to me. I wouldn't trust such an auto-detection to be reliable, since there are many ways in which an upstream caller might be executing in parallel. I'd rather just have a good default setting and be upfront about documentation.

Looking at #106, I think what they are seeing is this exact same problem.

That seems likely, but their example seems very complicated and not something I could easily reproduce locally.

I'll test numba.set_num_threads(1) to see if it works as well.

Please report back if it works!

riqiang-dp commented 2 years ago

Ah it doesn't work unfortunately. That line effectively sets numba threads to 1, but I still got this error: Terminating: fork() called from a process already using GNU OpenMP, this is unsafe. I'm guessing it's because numba will still try to spawn the processes even though num_threads is set to one? I don't know, this is really outside of my domain haha.

bmcfee commented 2 years ago

That's a bummer, but good to know! To the previous question: are you able to call resampy directly with parallel=False? That will force it to use the single-threaded implementation, and should also fix it for you. The catch here is that if you're using resampy through an intermediate library (eg librosa) the API isn't directly exposed yet.

riqiang-dp commented 2 years ago

Yeah you're right. I'll explore that option as well when I get time. For now the old version works well.

zhangrui-wolf commented 2 years ago

We had the same problem with librosa. It was finally traced to a problem with the dependency package resampy. I'm here to leave a comment. The easiest way to reproduce this is to use the multiprocessing library and run it in parallel.

talhaanwarch commented 2 years ago

Have same issue with huggingface https://github.com/huggingface/datasets/issues/4820