Open lukehsiao opened 1 year ago
I'm taking a look at this. I have one debugging question for you in the meantime however. When I look at _upload_to_uri_with_exclude it appears that it is a recursive copy. Is there a possibility that you've included the current directory "." or parent directory ".." and that's why the operation never finishes? One way of determining this would be to insert a log/print at https://github.com/ray-project/ray/blob/d7b2b49a962bf33dae7a50376f159ab15d80800f/python/ray/air/_internal/remote_storage.py#L238 to output the paths that are going to be copied from.
I'm taking a look at this. I have one debugging question for you in the meantime however. When I look at _upload_to_uri_with_exclude it appears that it is a recursive copy. Is there a possibility that you've included the current directory "." or parent directory ".." and that's why the operation never finishes?
In our case, exclude
is None, so that code path to _upload_to_uri_with_exclude
is actually never hit (and you'll notice it doesn't show in the flamegraph). So, it's essentially just the call to pyarrow.fs.copy_files
here:
I think this issue is a duplicate of #32372. I've added more details in that issue, but in a nutshell, pyarrow.fs.copy_files hangs for s3 buckets and with use_threads=True
if more files are uploaded than CPU cores available:
# On a 8 core Macbook
mkdir -p /tmp/pa-s3
cd /tmp/pa-s3
for i in {1..7}; do touch $i.txt; done
# This works
python -c "import pyarrow.fs; pyarrow.fs.copy_files('/tmp/pa-s3', 's3://bucket/folder')"
for i in {1..8}; do touch $i.txt; done
# This hangs forever
python -c "import pyarrow.fs; pyarrow.fs.copy_files('/tmp/pa-s3', 's3://bucket/folder')"
I could reproduce this for pyarrow 6 to 11.
It also only seems to come up for S3, not for GS.
This sounds very similar to nested parallelism deadlocks we have had in the past.
Outermost call: fork-join on a bunch of items (in this case it looks like we are doing fork-join on files) Inner task: fork-join on something else (e.g. in parquet it would be parquet column decoding)
If the inner-task is blocking on a "join" then it is wasting a thread pool thread. If enough of these thread pool threads get wasted then all thread pool threads are blocked waiting for other thread pool tasks and no thread is free to actually do the tasks.
The solution we adopted was to migrate to an async model so that "join" step becomes "return a future" instead of "block until done". This yields roughly the following rules:
It seems like the copy_files/s3 combination is violating one of the above rules. There is an OptionalParallelFor in CopyFiles which blocks but I think that is called from the user thread and so that is ok. @EpsilonPrime if you can reproduce I would grab a thread dump from gdb and check and see what the thread tasks are blocking on. The fix will probably be to move copy files over to using async APIs (internally).
I have written a reproduction testcase that detects the thread contention issue (and is ready to check in once the fix is ready). What is happening is that when copying a file (filesystem.cc:613) the CopyStream happens as expected and then is passed to the close routine to complete. That delegates to CloseAsync which handles uploading parts (calling UploadPart). To do this UploadPart then adds its work to the threadpool which overloads the executor. For the case of an 8 thread pool with 8 tasks (each small enough to fit in a single part) this ends up being 16 busy threads in a size 8 executor.
The easy solution is to limit the number of tasks to the pool (merely leaving one extra thread appears to be enough for the pool to empty although this needs verification). The second is to modify the close routine to take over the work of the existing thread (not be asynchronous). This would require reworking of at least 5 functions and might require even more work for the case where there are multiple parts per file (which we do not have a test for yet).
Describe the bug, including details regarding any error messages, version, and platform.
We are working on a python package that calls pyarrow's copy_files function to copy a local directory to S3. We notice that this seems to hang indefinitely for a directory, even though it works for individual files.
A simple reproducer seems to be:
Where ray just wraps pyarrow.fs.copy_files: https://github.com/ray-project/ray/blob/d7b2b49a962bf33dae7a50376f159ab15d80800f/python/ray/air/_internal/remote_storage.py#L195
This results in the following flamegraph.
And an
strace
of that process looks likeThis is using
On Python 3.10.9, on a machine running Ubuntu 20.04.5 LTS.
Given the flamegraph, it seemed to be an issue with pyarrow, not ray, which is why I opened the ticket here.
Component(s)
Python