Closed avalentino closed 2 years ago
Merging #93 (0603dea) into main (ee82e25) will decrease coverage by
1.35%
. The diff coverage is77.77%
.
@@ Coverage Diff @@
## main #93 +/- ##
==========================================
- Coverage 82.60% 81.25% -1.36%
==========================================
Files 4 4
Lines 92 96 +4
==========================================
+ Hits 76 78 +2
- Misses 16 18 +2
Flag | Coverage Δ | |
---|---|---|
unittests | 81.25% <77.77%> (-1.36%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
resampy/core.py | 80.76% <77.77%> (-2.57%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update ee82e25...0603dea. Read the comment docs.
Thanks for getting this one going!
After thinking this one over a bit, I'm not sure that an environment variable is the best way to go here. (If for no other reason, it's kind of a pain to set up unit tests for.)
What do you think about making this more configurable via parameters to the user-facing API? E.g.
resampy.resample(..., parallel=True) # or parallel=False
we could have two jit-wrapped versions of the core function on hand by replacing https://github.com/bmcfee/resampy/blob/ee82e25c7b965a47df4d17daf157b8d3a36192a0/resampy/interpn.py#L7-L8 by something like
def _resample_f(x, y, t_out, interp_win, interp_delta, num_table, scale=1.0):
...
resample_f_p = numba.jit(nopython=True, nogil=True, parallel=True)(_resample_f)
resample_f_s = numba.jit(nopython=True, nogil=True, parallel=False)(_resample_f)
and then have the user-facing API select between the two. It's a little hacky, but should not present significant overhead and would certainly be easier to test and benchmark.
As an aside, we might want to consider adding cached compilation (to both) so that we don't have to pay the startup cost each time.
Dear @bmcfee, I like a lot your solution. I will try to implement it ASAP. Just to filan details:
resample
should be parallel=True
?
- Can you please confirm that the default option for
resample
should beparallel=True
?
I'd like to benchmark it on typical loads before making a decision on this.
- do you want to run all tests for both the sequential and parallel version or just a dedicated test for the sequential version is enough?
Nah, I think a simple numerical equivalence check for one run would be sufficient. Note that we might not get exact floating point equivalence here because fp math is not really associative or commutative, but it shouldn't drift too far from the serial implementation.
Thanks @avalentino ! This is looking good - any sense of how it benchmarks on typical loads?
Quick follow-up, I tried benchmarking this on some typical loads (10s, 60s, 600s signals at 44100→32000, float32) and didn't see any consistent benefit. I also tried this on both my laptop (8 cores) and a 24-core server, with no clear difference in behavior.
A couple of thoughts:
@bmcfee I made some benchmarking when I initially proposed to activate the parallelism. In this moment I do not remember details but the speedup on my quac-core laptop laptop was around 3.1, which is not bad. ASAP (at the moment I cannot access my laptop) I will try to recover my benchmark script, re-run it on my fresh Ubuntu installation and share it.
Regarding the possible use of prange
in inner loops at a first glance I would say that it is not a good idea, but nothing is better than a direct test
OK @bmcfee, it seems that the cache
parameter has a behaviour that is unexpected for me.
I should read better the documentation.
Meanwhile I have just removed the the option form the call to numba.jit
and now things seems to behave like expected.
CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (4 physical cores)
CPU count: 8
Input size 220500
Output size: 882000
best of 5: 1780.217 [ms]
best of 5: 471.998 [ms]
speedup: 3.77165920690257
I have used the following script to benchmark the resample function:
#!/usr/bin/env python3
import os
import timeit
import numpy as np
import resampy
# CPU info
# https://stackoverflow.com/questions/4842448/getting-processor-information-in-python
def get_processor_name():
import platform
import subprocess
if platform.system() == "Windows":
return platform.processor()
elif platform.system() == "Darwin":
import os
os.environ['PATH'] = os.environ['PATH'] + os.pathsep + '/usr/sbin'
command ="sysctl -n machdep.cpu.brand_string"
return subprocess.check_output(command).strip()
elif platform.system() == "Linux":
import re
mode_name = None
cpu_cores = None
with open('/proc/cpuinfo') as fd:
cpuinfo = fd.read()
for line in cpuinfo.split("\n"):
if "model name" in line:
model_name = re.sub( ".*model name.*:", "", line, 1)
elif "cpu cores" in line:
cpu_cores = re.sub( ".*cpu cores.*:", "", line, 1).strip()
if None not in (cpu_cores, mode_name):
break
return f'{model_name} ({cpu_cores} physical cores)'
return ""
print('CPU:', get_processor_name())
print('CPU count:', os.cpu_count())
# setup
sr_orig = 22050.0
f0 = 440
T = 10
ovs = 4
t = np.arange(T * sr_orig) / sr_orig
x = np.sin(2 * np.pi * f0 * t)
sr_new = sr_orig * ovs
print('Input size', len(x))
print('Output size:', len(x) * ovs)
# compile
resampy.resample(x, sr_orig, sr_new, parallel=False)
resampy.resample(x, sr_orig, sr_new, parallel=True)
# timeit
debug = False
repeat = 5
results = {}
for label, parallel in zip(('sequential', 'parallel'), (False, True)):
timer = timeit.Timer(
f'resampy.resample(x, sr_orig, sr_new, parallel={parallel})',
globals=globals().copy())
number, _ = timer.autorange()
number = max(3, number)
# print('number:', number)
times = timer.repeat(repeat, number)
if debug:
print('sequential', times)
results[label] = min(times)/number * 1e3
print(f'best of {repeat}: {results[label]:.3f} [ms]')
print('speedup:', results['sequential']/results['parallel'])
Meanwhile I have just removed the the option form the call to
numba.jit
and now things seems to behave like expected.
Nice - confirming that I also see some good speedups on my laptop with a basic test load (2 runs each to ensure eliminating compilation effects):
In [4]: sr = 44100
In [5]: sr_new = 32000
In [6]: x = np.random.randn(120 * sr).astype(np.float32)
In [7]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
3.14 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
3.41 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
723 ms ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
791 ms ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and on the 24-core server:
In [9]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
298 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
5 s ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm curious about how the cache= parameter is affecting this though, especially given this bit of the documentation
The following features are explicitly verified to work with caching: ufuncs and gufuncs for the cpu and parallel target parallel accelerator features (i.e. parallel=True)
We should probably dig into this and get a better understanding of what the trouble is.
Otherwise, I'm happy for parallel=True to be the default going forward.
It seems that the numba caching get somewhat confused if the one applies two timed the jit
to the same python function.
I made a simple test consisting in defining two identical python functions with different names (_resample_f_p
, and _resample_f_s
).
When I apply the jit
decorator with parallel=True/False
and cache=True
all seems to work.
Probably it is better to leave out the caching for the moment.
Maybe we could try to open an issue for numba regarding this strange caching behaviour.
It seems that the numba caching get somewhat confused if the one applies two timed the
jit
to the same python function.
Ah! That makes sense, and is silly. Maybe we could just do something like
resample_f_s = jit(..., parallel=False)(_resample_f)
resample_f_p = jit(..., parallel=True)(copy(_resample_f))
?
But otherwise agreed that caching is not necessary for this PR.
The solution using copy.copy
seems not to work for me.
I have also tried to modify the __name__
of the copied function.
The solution using
copy.copy
seems not to work for me.
Alright, let's punt it then. Probably this is an issue to raise with the numba folks - caching should depend on the jit parameters as well as the function body, and it seems like a bug if that's not the case.
There is indeed an open numba issue: https://github.com/numba/numba/issues/6845
For me it is complete. Please go ahead.
:shipit: thanks!
This PR enable the numba jit parallel option by default. The option can be disabled by setting an environment variable.