avalentino commented 2 years ago

This PR enable the numba jit parallel option by default. The option can be disabled by setting an environment variable.

codecov-commenter commented 2 years ago

Codecov Report

Merging #93 (0603dea) into main (ee82e25) will decrease coverage by 1.35%. The diff coverage is 77.77%.

@@            Coverage Diff             @@
##             main      #93      +/-   ##
==========================================
- Coverage   82.60%   81.25%   -1.36%     
==========================================
  Files           4        4              
  Lines          92       96       +4     
==========================================
+ Hits           76       78       +2     
- Misses         16       18       +2

Flag	Coverage Δ
unittests	`81.25% <77.77%> (-1.36%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
resampy/core.py	`80.76% <77.77%> (-2.57%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ee82e25...0603dea. Read the comment docs.

bmcfee commented 2 years ago

Thanks for getting this one going!

After thinking this one over a bit, I'm not sure that an environment variable is the best way to go here. (If for no other reason, it's kind of a pain to set up unit tests for.)

What do you think about making this more configurable via parameters to the user-facing API? E.g.

resampy.resample(..., parallel=True)  # or parallel=False

we could have two jit-wrapped versions of the core function on hand by replacing https://github.com/bmcfee/resampy/blob/ee82e25c7b965a47df4d17daf157b8d3a36192a0/resampy/interpn.py#L7-L8 by something like

def _resample_f(x, y, t_out, interp_win, interp_delta, num_table, scale=1.0):
    ...

resample_f_p = numba.jit(nopython=True, nogil=True, parallel=True)(_resample_f)
resample_f_s = numba.jit(nopython=True, nogil=True, parallel=False)(_resample_f)

and then have the user-facing API select between the two. It's a little hacky, but should not present significant overhead and would certainly be easier to test and benchmark.

As an aside, we might want to consider adding cached compilation (to both) so that we don't have to pay the startup cost each time.

avalentino commented 2 years ago

Dear @bmcfee, I like a lot your solution. I will try to implement it ASAP. Just to filan details:

Can you please confirm that the default option for resample should be parallel=True?
do you want to run all tests for both the sequential and parallel version or just a dedicated test for the sequential version is enough?

bmcfee commented 2 years ago

Can you please confirm that the default option for resample should be parallel=True?

I'd like to benchmark it on typical loads before making a decision on this.

do you want to run all tests for both the sequential and parallel version or just a dedicated test for the sequential version is enough?

Nah, I think a simple numerical equivalence check for one run would be sufficient. Note that we might not get exact floating point equivalence here because fp math is not really associative or commutative, but it shouldn't drift too far from the serial implementation.

bmcfee commented 2 years ago

Thanks @avalentino ! This is looking good - any sense of how it benchmarks on typical loads?

bmcfee commented 2 years ago

Quick follow-up, I tried benchmarking this on some typical loads (10s, 60s, 600s signals at 44100→32000, float32) and didn't see any consistent benefit. I also tried this on both my laptop (8 cores) and a 24-core server, with no clear difference in behavior.

A couple of thoughts:

It may be necessary to enable prange for the inner loops as well as the outer loop
We might have to do a bit of digging into the compiler optimization to see if it's missing something, or if it's optimizing at all.

avalentino commented 2 years ago

@bmcfee I made some benchmarking when I initially proposed to activate the parallelism. In this moment I do not remember details but the speedup on my quac-core laptop laptop was around 3.1, which is not bad. ASAP (at the moment I cannot access my laptop) I will try to recover my benchmark script, re-run it on my fresh Ubuntu installation and share it.

avalentino commented 2 years ago

Regarding the possible use of prange in inner loops at a first glance I would say that it is not a good idea, but nothing is better than a direct test

avalentino commented 2 years ago

OK @bmcfee, it seems that the cache parameter has a behaviour that is unexpected for me. I should read better the documentation. Meanwhile I have just removed the the option form the call to numba.jit and now things seems to behave like expected.

CPU:  Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (4 physical cores)
CPU count: 8
Input size 220500
Output size: 882000
best of 5: 1780.217 [ms]
best of 5: 471.998 [ms]
speedup: 3.77165920690257

I have used the following script to benchmark the resample function:

#!/usr/bin/env python3

import os
import timeit
import numpy as np
import resampy

# CPU info
# https://stackoverflow.com/questions/4842448/getting-processor-information-in-python
def get_processor_name():
    import platform
    import subprocess
    if platform.system() == "Windows":
        return platform.processor()
    elif platform.system() == "Darwin":
        import os
        os.environ['PATH'] = os.environ['PATH'] + os.pathsep + '/usr/sbin'
        command ="sysctl -n machdep.cpu.brand_string"
        return subprocess.check_output(command).strip()
    elif platform.system() == "Linux":
        import re
        mode_name = None
        cpu_cores = None
        with open('/proc/cpuinfo') as fd:
            cpuinfo = fd.read()
            for line in cpuinfo.split("\n"):
                if "model name" in line:
                    model_name = re.sub( ".*model name.*:", "", line, 1)
                elif "cpu cores" in line:
                    cpu_cores = re.sub( ".*cpu cores.*:", "", line, 1).strip()
                if None not in (cpu_cores, mode_name):
                    break
        return f'{model_name} ({cpu_cores} physical cores)'

    return ""

print('CPU:', get_processor_name())
print('CPU count:', os.cpu_count())

# setup
sr_orig = 22050.0
f0 = 440
T = 10
ovs = 4
t = np.arange(T * sr_orig) / sr_orig
x = np.sin(2 * np.pi * f0 * t)

sr_new = sr_orig * ovs
print('Input size', len(x))
print('Output size:', len(x) * ovs)

# compile
resampy.resample(x, sr_orig, sr_new, parallel=False)
resampy.resample(x, sr_orig, sr_new, parallel=True)

# timeit
debug = False
repeat = 5
results = {}
for label, parallel in zip(('sequential', 'parallel'), (False, True)):
    timer = timeit.Timer(
        f'resampy.resample(x, sr_orig, sr_new, parallel={parallel})',
        globals=globals().copy())

    number, _ = timer.autorange()
    number = max(3, number)
    # print('number:', number)

    times = timer.repeat(repeat, number)
    if debug:
        print('sequential', times)
    results[label] = min(times)/number * 1e3
    print(f'best of {repeat}: {results[label]:.3f} [ms]')

print('speedup:', results['sequential']/results['parallel'])

bmcfee commented 2 years ago

Meanwhile I have just removed the the option form the call to numba.jit and now things seems to behave like expected.

Nice - confirming that I also see some good speedups on my laptop with a basic test load (2 runs each to ensure eliminating compilation effects):

In [4]: sr = 44100

In [5]: sr_new = 32000

In [6]: x = np.random.randn(120 * sr).astype(np.float32)

In [7]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
3.14 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
3.41 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
723 ms ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
791 ms ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

and on the 24-core server:

In [9]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
298 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
5 s ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I'm curious about how the cache= parameter is affecting this though, especially given this bit of the documentation

The following features are explicitly verified to work with caching: ufuncs and gufuncs for the cpu and parallel target parallel accelerator features (i.e. parallel=True)

We should probably dig into this and get a better understanding of what the trouble is.

Otherwise, I'm happy for parallel=True to be the default going forward.

avalentino commented 2 years ago

It seems that the numba caching get somewhat confused if the one applies two timed the jit to the same python function. I made a simple test consisting in defining two identical python functions with different names (_resample_f_p, and _resample_f_s). When I apply the jit decorator with parallel=True/False and cache=True all seems to work. Probably it is better to leave out the caching for the moment. Maybe we could try to open an issue for numba regarding this strange caching behaviour.

bmcfee commented 2 years ago

It seems that the numba caching get somewhat confused if the one applies two timed the jit to the same python function.

Ah! That makes sense, and is silly. Maybe we could just do something like

resample_f_s = jit(..., parallel=False)(_resample_f)
resample_f_p = jit(..., parallel=True)(copy(_resample_f))

?

But otherwise agreed that caching is not necessary for this PR.

avalentino commented 2 years ago

The solution using copy.copy seems not to work for me. I have also tried to modify the __name__ of the copied function.

bmcfee commented 2 years ago

The solution using copy.copy seems not to work for me.

Alright, let's punt it then. Probably this is an issue to raise with the numba folks - caching should depend on the jit parameters as well as the function body, and it seems like a bug if that's not the case.

avalentino commented 2 years ago

There is indeed an open numba issue: https://github.com/numba/numba/issues/6845

avalentino commented 2 years ago

For me it is complete. Please go ahead.

bmcfee commented 2 years ago

:shipit: thanks!

bmcfee / resampy

Enable parallelism #93

Codecov Report