Closed jonded94 closed 3 months ago
This pull request optimizes the CRC computation by releasing the Global Interpreter Lock (GIL) during the compute-intensive part of the operation. This change allows for better multi-threading performance, significantly reducing the computation time when using multiple threads.
Files | Changes |
---|---|
_crc32c.c |
Introduced GIL release and reacquisition around the CRC computation to improve multi-threading performance. |
@jonded94 thanks for this contribution. It's been a long time since I wrote the package originally, and after so long I had assumed we were releasing the GIL already.
I agree with the change in principle; as you point out it unnecessary hinders multithreaded computation. Having said that, I'd like to have a look at the performance penalty as a function of input buffer size. Surely there's a cost associated with releasing/acquiring the GIL, which might affect performance in single-threaded cases, specially with small buffer sizes. I seem to remember binascii.crc32
does (or did) something similar.
I'll run some benchmarks locally and post results here. But again, thanks again for the contribution, most definitely a version of these changes will make it in.
@jonded94 thanks again for taking interest in this issue, and providing the patch.
I wrote the following benchmark to measure the overhead of releasing the GIL as a function of input buffer size, both for the software and hardware modes.
import time
import seaborn as sns
from pandas import DataFrame
from matplotlib import pyplot as plt
from crc32c import crc32c
def _run(buf, n_iter: int, release_gil: bool, sw_mode: bool) -> float:
start = time.monotonic()
[crc32c(buf, 0, release_gil, sw_mode) for _ in range(n_iter)]
return time.monotonic() - start
def _run_with_bufsizes(n_iter: int) -> DataFrame:
bufsizes = (2 ** n for n in range(1, 20))
buffers = [b'0' * size for size in bufsizes]
return DataFrame.from_records(
[(len(buffer), _run(buffer, n_iter, False, sw_mode), _run(buffer, n_iter, True, sw_mode), sw_mode) for buffer in buffers for sw_mode in (False, True)],
columns=("bufsize", "time_no_release", "time_release", "sw_mode")
)
n_iter = 10000
df = _run_with_bufsizes(n_iter)
df["release_overhead"] = (df["time_release"] / df["time_no_release"]) - 1
sns.set_theme()
sns.catplot(df, x="bufsize", y="release_overhead", hue="sw_mode", kind="bar")
plt.show()
You'll note that the crc32c
function used for this accepts two extra options to release/not-release the GIL, and force using the SW implementation or not. This is the diff on top of your changes:
--- a/_crc32c.c
+++ b/_crc32c.c
@@ -41,6 +41,8 @@ PyObject* crc32c_crc32c(PyObject *self, PyObject *args) {
Py_buffer pbin;
unsigned char *bin_data = NULL;
uint32_t crc = 0U, result;
+ int release_gil = 0;
+ int sw_mode = 0;
/* In python 3 we accept only bytes-like objects */
const char *format =
@@ -49,17 +51,26 @@ PyObject* crc32c_crc32c(PyObject *self, PyObject *args) {
#else
"s*"
#endif
- "|I:crc32";
+ "|Ipp:crc32c";
- if (!PyArg_ParseTuple(args, format, &pbin, &crc) )
+ if (!PyArg_ParseTuple(args, format, &pbin, &crc, &release_gil, &sw_mode) )
return NULL;
+ crc_function the_crc_fn = (sw_mode ? _crc32c_sw_slicing_by_8 : crc_fn);
+
bin_data = pbin.buf;
+ if (release_gil) {
Py_BEGIN_ALLOW_THREADS
crc ^= 0xffffffff;
- result = crc_fn(crc, bin_data, pbin.len);
+ result = the_crc_fn(crc, bin_data, pbin.len);
result ^= 0xffffffff;
Py_END_ALLOW_THREADS
+ }
+ else {
+ crc ^= 0xffffffff;
+ result = the_crc_fn(crc, bin_data, pbin.len);
+ result ^= 0xffffffff;
+ }
PyBuffer_Release(&pbin);
return PyLong_FromUnsignedLong(result);
These are the results on my system (AMD Ryzen 7 5825U, CPython 3.12):
As expected, the penalty is greater at smaller buffer sizes, and for the HW mode. Very roughly, and of course very specifically to my system, the penalty in HW mode decreases to ~2% at 32KB, and ~1% at 128KB. For SW those happen at somewhere around ~4KB and at ~8KB.
Also, as I remembered, the binascii.crc32
function does indeed conditionally release the GIL, see https://github.com/python/cpython/blob/cc6839a1810290e483e5d5f0786d9b46c4294d47/Modules/binascii.c#L772-L798. They put the limit at 5 KB, which more or less matches the SW mode at somewhere the ~1/2%.
In summary, I think we should restrict the releasing of the GIL to a minimum buffer size.
I'm a bit torn on the actual value though: it should be small enough that it covers most user cases, but big enough that the overhead of releasing the GIL isn't too big. A buffer size that I think would be suitable would be similar to that used for file or socket reading operations, since the data fed into this package comes likely from those. These buffer sizes are usually in the single or double-digit KBs. And given the numbers I got in my system, I'm inclined to make the cut at 32 KB. @jonded94 would you want to put that change in?
If we also wanted more flexibility, we could offer a module-level function that altered this limit, or even a flag like in the diff above that can be used on a per-call basis.
Thank you very much for your in-depth analysis, @rtobar ! Yes, it probably makes sense only releasing GIL after a certain threshold. I actually was unaware that it creates so much of an overhead.
TODOs:
crc32c(buf: bytes, checksum: int = 0, release_gil: bool | None = None)
release_gil = None
should dynamically determine whether to release the GIL (i.e. after a 32KiB threshold for example). The user can specify on its own if he knows it better by giving a boolean explicitly.mypy
will error with this message: error: Skipping analyzing "crc32c": module is installed, but missing library stubs or py.typed marker [import-untyped]
Will do this as soon as I find the time for it (probably this weekend or slightly later) :)
I now made this package PEP 561 compliant and added typehint stub files. This seems to have worked, now even a crc32c.crc32c(io.BytesIO(b"123").getbuffer())
properly typechecks with mypy --strict
.
Also, I added a new keyword argument release_gil
. This can be bool | int
, and has this behaviour:
False
: Never release GILTrue
: Always release GILThanks rtobar for your in-depth review! :)
I hopefully addressed all of your concerns and pushed a bunch of new commits. Specifically, I also removed the adding of typehints from this PR and moved them into a separate one (https://github.com/ICRAR/crc32c/pull/49).
Please let me know if you see anything that still needs further refinement.
Implemented the suggestions from your review and squashed the changes into one commit. Should be fine now? :smile:
Many thanks again @jonded94 for the effort and patience! I'll wait for CI to check that everything's fine and I'll merge.
I'm more than happy to publish a new release on PyPI after this, would you want to get the other PR through first?
would you want to get the other PR through first
Yes, surely, as we're using mypy
to a broad degree in our entire codebase and this would be very helpful (I'm having to add # type: ignore[import-untyped]
everywhere where I import this library right now).
Setting "sw mode" as an additional kwarg could be interesting for a later PR, after the release.
Unfortunately, this library captures the Python GIL for the entirety of its computation. I chose this library because it's quite fast compared to others, but this is pretty worthless if it's forcing one to use multiprocessing for a realtively mundane task such as computing a hash of a given bytebuffer.
I think this should work. On a very simply test benchmark, I saw a sizeable performance improvement using this fork (~46s vs. ~8s using 16 threads on my 13th Gen Intel(R) Core(TM) i7-1370P):
Summary by Sourcery
Enhance the CRC computation by releasing the GIL during the compute-intensive operation, resulting in significant performance improvements in multi-threaded scenarios.
Enhancements: