Open dxdc opened 3 weeks ago
@amol- @rok
any chance you're able to take a look at this issue?
It's very simple to reproduce. The most ultra-simple use case is running this stripped down script with this file:
import pyarrow.csv as pv
# setting use_threads to False does not hang python
read_options = pv.ReadOptions(encoding="big5", use_threads=True)
parse_options = pv.ParseOptions(delimiter="|")
with open("sample.txt", "rb") as f:
table = pv.read_csv(f, read_options=read_options, parse_options=parse_options)
There is a bug with threads and pyarrow. I have now an additional file I can use for testing on my side. I'm also willing to dig into it if you have a sense of where the issue may lie.
It seems it might have something to do with the Read operation not getting properly aborted.
The TransformInputStream::Read
method, doesn't do anything special to handle the case where the transformer has failed. So it isn't immediately obvious where the read operation would get aborted in such case.
The following patch seemed to fix the issue for me
diff --git a/cpp/src/arrow/io/transform.cc b/cpp/src/arrow/io/transform.cc
index 3fdf5a7a9..a8c40ee53 100644
--- a/cpp/src/arrow/io/transform.cc
+++ b/cpp/src/arrow/io/transform.cc
@@ -102,7 +102,11 @@ Result<int64_t> TransformInputStream::Read(int64_t nbytes, void* out) {
const bool have_eof = (buf->size() == 0);
// Even if EOF is met, let the transform function run a last time
// (for example to flush internal buffers)
- ARROW_ASSIGN_OR_RAISE(buf, impl_->transform_(std::move(buf)));
+ auto transform_status = impl_->transform_(std::move(buf));
+ if (!transform_status.ok()) {
+ RETURN_NOT_OK(this->Abort());
+ RETURN_NOT_OK(transform_status);
+ }
avail_size += buf->size();
avail.push_back(std::move(buf));
if (have_eof) {
but someone who is more confident with the IO part of the codebase might have to check this more in detail
@amol- Thanks for your quick analysis. I had a feeling the issue might be more complex within the repo, and it looks like your findings point in that direction.
On my end, setting use_threads
= False seems to resolve the issue, so I believe it's rooted in thread management. I did manage to reproduce the error on a particular file that would fail intermittently, but no longer have access to it unfortunately.
On my end, setting
use_threads
= False seems to resolve the issue, so I believe it's rooted in thread management.
Yes, it does have to do with threading, or better with the threads getting hung up on waiting for some async future completing. As you can't exactly kill threads, on shutdown the threadpool has to gently ask to the threads to quit, but if a thread is stuck in some syscall (like waiting for a mutex) it will never notice that it has to quit and will hang there forever.
Yesterday evening I didn't have the time to investigate more closely the relationship between the components involved in async reading and the threadpool, but I'll try to get back to it as soon as I have time to.
Summary
When using
pyarrow.csv.read_csv
withReadOptions(use_threads=True)
and encountering aUnicodeDecodeError
, Python hangs indefinitely during the shutdown process. This issue occurs consistently across multiple Python versions andpyarrow
versions.NOTE: I originally reported this here #43741 but now I have a working file that can be tested.
I hope that someone familiar with the internals of the pyarrow.csv module, particularly with the threading and shutdown procedures, can help identify and resolve this issue.
Steps to Reproduce
Run the following Python script:
Use a file (
sample.txt
) that contains data in an encoding (e.g., Big5, Shift-JIS) likely to trigger aUnicodeDecodeError
.NOTE: Minor edits to this file result in the issue no longer being reproducible.
Observe that the script prints "Program exited successfully." but then hangs indefinitely during the Python shutdown process.
Expected Behavior
The script should exit cleanly after execution, even if a
UnicodeDecodeError
occurs.Actual Behavior
The script hangs indefinitely during the logging shutdown process after encountering a
UnicodeDecodeError
. This behavior is consistent whenuse_threads=True
is set.Output
The output includes a traceback ending with a
UnicodeDecodeError
, followed by a hang during the logging shutdown process. Below is the detailed Pdb step trace after the program exits:Environment
pyarrow
versions tested: 9.0.0 to 17.0.0Additional Information
UnicodeDecodeError
is raised during CSV parsing withuse_threads=True
.use_threads=False
) resolves the issue.Suggested Priority
High - The hang is significant as it prevents Python from exiting cleanly, which could impact various applications relying on
pyarrow
for multi-threaded CSV processing.Please let me know if additional information is required.
Component(s)
Python