Closed pjurgielewicz closed 1 year ago
Thanks for the repro case. I'll take a look at it.
I'm reading the PyTables documentation (used by pandas HDFStore) and it looks like open & close are not thread-safe even with the GIL. I know your examples passes in Python 3.10 (with the GIL), but the documentation suggest that it's still not thread-safe. Most likely running without the GIL allows many more interleavings and therefore more opportunities for the thread-safety issues to actually occur in a given run.
https://www.pytables.org/cookbook/threading.html
With a lock around the open/close, the repro passes (example here), but I think there are also other thread-safety issues lurking that are related to nogil. For example, PyTables is being installed from the source distribution that includes pre-Cythonized files that aren't thread-safe without the GIL [1]. I'll upload a PyTables wheel built with a compatible version of Cython, but I don't think that will fix the issues with open/close. I'll look into it a bit more and see if I can understand the issues further.
It wasn't clear to me from the issue how you are using Windows. Are you using "nogil" Python on Windows? In general, I would not recommend doing so because I haven't built compatible packages, and some packages require patches to source code.
[1] There is a patched version of Cython that is thread-safe without the GIL. You get this version when you run pip install Cython
, but that doesn't help with source distributions that include pre-Cythonized files.
@colesbury Thank you for the time you spent on that. As you found out the whole operation of threads will probably break the system or data due to thread-safety issues... But I will try your suggestions anyway since I do not have any better idea than the creation of a localhost server using ZMQ and distributing the data to dedicated archiver processes ;)
The case I presented easily shows that there are a lot of places in Python code that heavily rely on GIL/are far from being thread-safe. It is a long way for nogil accommodation but it is certainly worth the effort of the community.
Regarding the nogil fork on Windows: Yes, I managed to build nogil Python directly from the source along with the required packages (https://github.com/pjurgielewicz/nogil-support-installers, step-by-step installation guide is not included there yet). I searched your repositories looking for packages with nogil tags and tried to build them. In the case of some other packages - I had to either get them directly from PIP or specify the version to be installed by PIP or build them from the original source (fiddling a lot with tags). It was certainly not bread and butter but I established the build pipeline which works for us. The added value is that in the lab we are running simultaneously Linux and Windows machines and can see whether our Dose-3D DAQ behaviour is any different (there are a few Windows quirks but they are easily handled with simple if statements).
@pjurgielewicz I've looked a bit more and the problems go beyond PyTables and to the underlying HDF5 C library:
At present, the HDF5 library is not thread safe. To allow its use by multi-threaded applications, in the “thread safe” build, the library is equipped with a global lock that
allows only one thread into the library at a time – effectively making the entire HDF5 library a giant critical region.
From https://docs.hdfgroup.org/hdf5/rfc/RFC_multi_thread.pdf
So there's no real parallelism to be had by having multiple threads (in a single process) read or write HDF5 files, even if you were writing your whole application in C or C++. So I think something like the ZMQ strategy might be necessary if you need parallelism in the reading or writing HDF5 files (and not just parallelism in other parts of your system).
Regarding Windows: Can you make the linked GitHub project public (it's currently private)? For context, the scripts to build wheels for Linux (and occasionally macOS) are here. It's probably not terribly useful for Windows, except that it contains links to nogil forks when necessary.
A few months ago, I had experimented with using GitHub actions to build wheels, with the intention of better supporting macOS and maybe Windows. I may revisit that again and having your installers for Windows might help.
@colesbury
So there's no real parallelism to be had by having multiple threads (in a single process) read or write HDF5 files, even if you were writing your whole application in C or C++. So I think something like the ZMQ strategy might be necessary if you need parallelism in the reading or writing HDF5 files (and not just parallelism in other parts of your system).
It is not that terrible, we were planning the usage of ZMQ for the dynamic data distribution either way. On the other hand, this will drag our system measurements (I can now start parallel measurements but without data saving ;) ).
Sorry, I forgot that I set this repository to private. It is now public but there is nothing more than a bunch of installers and pre-built site-packages to be copy-pasted into the right directory. As I mentioned, I have an instruction on how to build everything from scratch (not published yet) but I can give no guarantee it would always work or that this is the optimal way.
Hi, I have stepped into the problem while trying to use multiple
pandas.HDFStore()
objects in the same interpreter but in separate nogil threads. The use case is pretty simple: each thread operates on its own file. The minimum reproduction example is as follows:I tested that on Windows and Ubuntu 22.04. On Linux it always segfaults while on Windows the script may fail silently or end up with whole range of different exceptions/outcomes:
or
Or one of the files might end up filled with data or even both of them are empty.
It looks to me like a C API issue of HDF package.
As cross-check I tested the script above in different interpreter & threading/MP configurations:
On Ubuntu I tested this with such a set of packages:
While on Windows:
It would be awesome if you (@colesbury) could look into that.