Release GIL when doing the compute-intensive CRC computation

jonded94 commented 3 months ago

Unfortunately, this library captures the Python GIL for the entirety of its computation. I chose this library because it's quite fast compared to others, but this is pretty worthless if it's forcing one to use multiprocessing for a realtively mundane task such as computing a hash of a given bytebuffer.

I think this should work. On a very simply test benchmark, I saw a sizeable performance improvement using this fork (~46s vs. ~8s using 16 threads on my 13th Gen Intel(R) Core(TM) i7-1370P):

import crc32c
import concurrent.futures
import io
import time

def calc_crc(inp: io.BytesIO) -> int:
    return crc32c.crc32c(inp.getbuffer())

N_DATA = 1_000_000_000
N_COUNT = 1000

data = io.BytesIO(b"0"*N_DATA)

ts = time.perf_counter()
res = [calc_crc(data) for _ in range(N_COUNT)]
te = time.perf_counter()
print(te-ts)  # 46.33s

with concurrent.futures.ThreadPoolExecutor(max_workers=16) as executor:
    ts = time.perf_counter()
    res_threaded = list(executor.map(calc_crc, [data for _ in range(N_COUNT)]))
    te = time.perf_counter()
    print(te-ts)  # 8.02s
    assert res == res_threaded

Summary by Sourcery

Enhance the CRC computation by releasing the GIL during the compute-intensive operation, resulting in significant performance improvements in multi-threaded scenarios.

Enhancements:

Release the Global Interpreter Lock (GIL) during the compute-intensive CRC computation to improve performance in multi-threaded environments.

sourcery-ai[bot] commented 3 months ago

Reviewer's Guide by Sourcery

This pull request optimizes the CRC computation by releasing the Global Interpreter Lock (GIL) during the compute-intensive part of the operation. This change allows for better multi-threading performance, significantly reducing the computation time when using multiple threads.

File-Level Changes

Files	Changes
`_crc32c.c`	Introduced GIL release and reacquisition around the CRC computation to improve multi-threading performance.

Tips

- Trigger a new Sourcery review by commenting `@sourcery-ai review` on the pull request. - Continue your discussion with Sourcery by replying directly to review comments. - You can change your review settings at any time by accessing your [dashboard](https://app.sourcery.ai): - Enable or disable the Sourcery-generated pull request summary or reviewer's guide; - Change the review language; - You can always [contact us](mailto:support@sourcery.ai) if you have any questions or feedback.

rtobar commented 3 months ago

@jonded94 thanks for this contribution. It's been a long time since I wrote the package originally, and after so long I had assumed we were releasing the GIL already.

I agree with the change in principle; as you point out it unnecessary hinders multithreaded computation. Having said that, I'd like to have a look at the performance penalty as a function of input buffer size. Surely there's a cost associated with releasing/acquiring the GIL, which might affect performance in single-threaded cases, specially with small buffer sizes. I seem to remember binascii.crc32 does (or did) something similar.

I'll run some benchmarks locally and post results here. But again, thanks again for the contribution, most definitely a version of these changes will make it in.

rtobar commented 3 months ago

@jonded94 thanks again for taking interest in this issue, and providing the patch.

I wrote the following benchmark to measure the overhead of releasing the GIL as a function of input buffer size, both for the software and hardware modes.

import time

import seaborn as sns
from pandas import DataFrame
from matplotlib import pyplot as plt

from crc32c import crc32c

def _run(buf, n_iter: int, release_gil: bool, sw_mode: bool) -> float:
    start = time.monotonic()
    [crc32c(buf, 0, release_gil, sw_mode) for _ in range(n_iter)]
    return time.monotonic() - start

def _run_with_bufsizes(n_iter: int) -> DataFrame:
    bufsizes = (2 ** n for n in range(1, 20))
    buffers = [b'0' * size for size in bufsizes]
    return DataFrame.from_records(
        [(len(buffer), _run(buffer, n_iter, False, sw_mode), _run(buffer, n_iter, True, sw_mode), sw_mode) for buffer in buffers for sw_mode in (False, True)],
        columns=("bufsize", "time_no_release", "time_release", "sw_mode")
    )

n_iter = 10000
df = _run_with_bufsizes(n_iter)
df["release_overhead"] = (df["time_release"] / df["time_no_release"]) - 1
sns.set_theme()
sns.catplot(df, x="bufsize", y="release_overhead", hue="sw_mode", kind="bar")
plt.show()

You'll note that the crc32c function used for this accepts two extra options to release/not-release the GIL, and force using the SW implementation or not. This is the diff on top of your changes:

--- a/_crc32c.c
+++ b/_crc32c.c
@@ -41,6 +41,8 @@ PyObject* crc32c_crc32c(PyObject *self, PyObject *args) {
        Py_buffer pbin;
        unsigned char *bin_data = NULL;
        uint32_t crc = 0U, result;
+       int release_gil = 0;
+       int sw_mode = 0;

        /* In python 3 we accept only bytes-like objects */
        const char *format =
@@ -49,17 +51,26 @@ PyObject* crc32c_crc32c(PyObject *self, PyObject *args) {
 #else
        "s*"
 #endif
-       "|I:crc32";
+       "|Ipp:crc32c";

-       if (!PyArg_ParseTuple(args, format, &pbin, &crc) )
+       if (!PyArg_ParseTuple(args, format, &pbin, &crc, &release_gil, &sw_mode) )
                return NULL;

+       crc_function the_crc_fn = (sw_mode ? _crc32c_sw_slicing_by_8 : crc_fn);
+
        bin_data = pbin.buf;
+       if (release_gil) {
        Py_BEGIN_ALLOW_THREADS
        crc ^= 0xffffffff;
-       result = crc_fn(crc, bin_data, pbin.len);
+       result = the_crc_fn(crc, bin_data, pbin.len);
        result ^= 0xffffffff;
        Py_END_ALLOW_THREADS
+       }
+       else {
+       crc ^= 0xffffffff;
+       result = the_crc_fn(crc, bin_data, pbin.len);
+       result ^= 0xffffffff;
+       }

        PyBuffer_Release(&pbin);
        return PyLong_FromUnsignedLong(result);

These are the results on my system (AMD Ryzen 7 5825U, CPython 3.12):

benchmark

As expected, the penalty is greater at smaller buffer sizes, and for the HW mode. Very roughly, and of course very specifically to my system, the penalty in HW mode decreases to ~2% at 32KB, and ~1% at 128KB. For SW those happen at somewhere around ~4KB and at ~8KB.

Also, as I remembered, the binascii.crc32 function does indeed conditionally release the GIL, see https://github.com/python/cpython/blob/cc6839a1810290e483e5d5f0786d9b46c4294d47/Modules/binascii.c#L772-L798. They put the limit at 5 KB, which more or less matches the SW mode at somewhere the ~1/2%.

In summary, I think we should restrict the releasing of the GIL to a minimum buffer size.

I'm a bit torn on the actual value though: it should be small enough that it covers most user cases, but big enough that the overhead of releasing the GIL isn't too big. A buffer size that I think would be suitable would be similar to that used for file or socket reading operations, since the data fed into this package comes likely from those. These buffer sizes are usually in the single or double-digit KBs. And given the numbers I got in my system, I'm inclined to make the cut at 32 KB. @jonded94 would you want to put that change in?

If we also wanted more flexibility, we could offer a module-level function that altered this limit, or even a flag like in the diff above that can be used on a per-call basis.

jonded94 commented 3 months ago

Thank you very much for your in-depth analysis, @rtobar ! Yes, it probably makes sense only releasing GIL after a certain threshold. I actually was unaware that it creates so much of an overhead.

TODOs:

Implement dynamic GIL release behaviour, probably with this final function header: crc32c(buf: bytes, checksum: int = 0, release_gil: bool | None = None)
- My idea would be that release_gil = None should dynamically determine whether to release the GIL (i.e. after a 32KiB threshold for example). The user can specify on its own if he knows it better by giving a boolean explicitly.
Make this package a PEP 561 (https://peps.python.org/pep-0561/) compliant package
- Because of a missing stub files, mypy will error with this message: error: Skipping analyzing "crc32c": module is installed, but missing library stubs or py.typed marker [import-untyped]
- Since there practically are only two functions in this package (with one being deprecated) I think it should be fine to include this in the PR? I thought of this because we're planning to slightly change the function header anyways, so proper type hints could be nice.

Will do this as soon as I find the time for it (probably this weekend or slightly later) :)

jonded94 commented 3 months ago

I now made this package PEP 561 compliant and added typehint stub files. This seems to have worked, now even a crc32c.crc32c(io.BytesIO(b"123").getbuffer()) properly typechecks with mypy --strict.

Also, I added a new keyword argument release_gil. This can be bool | int, and has this behaviour:

(default) -1 (or less): Automatically decide whether to release GIL or not. Will do it starting from buffer sizes >= 32KiB
0 or False: Never release GIL
1 or True: Always release GIL

jonded94 commented 3 months ago

Thanks rtobar for your in-depth review! :)

I hopefully addressed all of your concerns and pushed a bunch of new commits. Specifically, I also removed the adding of typehints from this PR and moved them into a separate one (https://github.com/ICRAR/crc32c/pull/49).

Please let me know if you see anything that still needs further refinement.

jonded94 commented 3 months ago

Implemented the suggestions from your review and squashed the changes into one commit. Should be fine now? :smile:

rtobar commented 3 months ago

Many thanks again @jonded94 for the effort and patience! I'll wait for CI to check that everything's fine and I'll merge.

I'm more than happy to publish a new release on PyPI after this, would you want to get the other PR through first?

jonded94 commented 3 months ago

would you want to get the other PR through first

Yes, surely, as we're using mypy to a broad degree in our entire codebase and this would be very helpful (I'm having to add # type: ignore[import-untyped] everywhere where I import this library right now).

Setting "sw mode" as an additional kwarg could be interesting for a later PR, after the release.

ICRAR / crc32c