apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.6k stars 3.54k forks source link

[Python] pyarrow.fs.copy_files hangs indefinitely #15233

Open lukehsiao opened 1 year ago

lukehsiao commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

We are working on a python package that calls pyarrow's copy_files function to copy a local directory to S3. We notice that this seems to hang indefinitely for a directory, even though it works for individual files.

A simple reproducer seems to be:

from ray.air._internal.remote_storage import upload_to_uri
dir = "/some/directory/local"
uri = "s3://some/s3/bucket"
upload_to_uri(dir, uri)

Where ray just wraps pyarrow.fs.copy_files: https://github.com/ray-project/ray/blob/d7b2b49a962bf33dae7a50376f159ab15d80800f/python/ray/air/_internal/remote_storage.py#L195

This results in the following flamegraph.

profile-idle

And an strace of that process looks like

...
stat("/some/directory/local/result.json", {st_mode=S_IFREG|0664, st_size=2345, ...}) = 0
getpid()                                = 1639901
futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 1639901
futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 1639901
futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 1639901
futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 1639901
futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 1639901
futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 1639901
futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 1639901
futex(0x5580a3b563e4, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 1639901
getpid()                                = 1639901
getpid()                                = 1639901
futex(0x5580a3d374e8, FUTEX_WAIT_PRIVATE, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTSTP {si_signo=SIGTSTP, si_code=SI_KERNEL} ---
--- stopped by SIGTSTP ---
+++ killed by SIGKILL +++

This is using

.venv ❯ pip show pyarrow                                                                                                                                                                                                    [23/01/6|17:08]
Name: pyarrow
Version: 8.0.0

On Python 3.10.9, on a machine running Ubuntu 20.04.5 LTS.

Given the flamegraph, it seemed to be an issue with pyarrow, not ray, which is why I opened the ticket here.

Component(s)

Python

EpsilonPrime commented 1 year ago

I'm taking a look at this. I have one debugging question for you in the meantime however. When I look at _upload_to_uri_with_exclude it appears that it is a recursive copy. Is there a possibility that you've included the current directory "." or parent directory ".." and that's why the operation never finishes? One way of determining this would be to insert a log/print at https://github.com/ray-project/ray/blob/d7b2b49a962bf33dae7a50376f159ab15d80800f/python/ray/air/_internal/remote_storage.py#L238 to output the paths that are going to be copied from.

lukehsiao commented 1 year ago

I'm taking a look at this. I have one debugging question for you in the meantime however. When I look at _upload_to_uri_with_exclude it appears that it is a recursive copy. Is there a possibility that you've included the current directory "." or parent directory ".." and that's why the operation never finishes?

In our case, exclude is None, so that code path to _upload_to_uri_with_exclude is actually never hit (and you'll notice it doesn't show in the flamegraph). So, it's essentially just the call to pyarrow.fs.copy_files here:

https://github.com/ray-project/ray/blob/d7b2b49a962bf33dae7a50376f159ab15d80800f/python/ray/air/_internal/remote_storage.py#L209

krfricke commented 1 year ago

I think this issue is a duplicate of #32372. I've added more details in that issue, but in a nutshell, pyarrow.fs.copy_files hangs for s3 buckets and with use_threads=True if more files are uploaded than CPU cores available:

# On a 8 core Macbook
mkdir -p /tmp/pa-s3
cd /tmp/pa-s3 
for i in {1..7}; do touch $i.txt; done
# This works
python -c "import pyarrow.fs; pyarrow.fs.copy_files('/tmp/pa-s3', 's3://bucket/folder')"
for i in {1..8}; do touch $i.txt; done  
# This hangs forever
python -c "import pyarrow.fs; pyarrow.fs.copy_files('/tmp/pa-s3', 's3://bucket/folder')"

I could reproduce this for pyarrow 6 to 11.

It also only seems to come up for S3, not for GS.

westonpace commented 1 year ago

This sounds very similar to nested parallelism deadlocks we have had in the past.

Outermost call: fork-join on a bunch of items (in this case it looks like we are doing fork-join on files) Inner task: fork-join on something else (e.g. in parquet it would be parquet column decoding)

If the inner-task is blocking on a "join" then it is wasting a thread pool thread. If enough of these thread pool threads get wasted then all thread pool threads are blocked waiting for other thread pool tasks and no thread is free to actually do the tasks.

The solution we adopted was to migrate to an async model so that "join" step becomes "return a future" instead of "block until done". This yields roughly the following rules:

It seems like the copy_files/s3 combination is violating one of the above rules. There is an OptionalParallelFor in CopyFiles which blocks but I think that is called from the user thread and so that is ok. @EpsilonPrime if you can reproduce I would grab a thread dump from gdb and check and see what the thread tasks are blocking on. The fix will probably be to move copy files over to using async APIs (internally).

EpsilonPrime commented 1 year ago

I have written a reproduction testcase that detects the thread contention issue (and is ready to check in once the fix is ready). What is happening is that when copying a file (filesystem.cc:613) the CopyStream happens as expected and then is passed to the close routine to complete. That delegates to CloseAsync which handles uploading parts (calling UploadPart). To do this UploadPart then adds its work to the threadpool which overloads the executor. For the case of an 8 thread pool with 8 tasks (each small enough to fit in a single part) this ends up being 16 busy threads in a size 8 executor.

The easy solution is to limit the number of tasks to the pool (merely leaving one extra thread appears to be enough for the pool to empty although this needs verification). The second is to modify the close routine to take over the work of the existing thread (not be asynchronous). This would require reworking of at least 5 functions and might require even more work for the case where there are multiple parts per file (which we do not have a test for yet).