apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.44k stars 3.52k forks source link

copy_files does not work for S3 -> local #40300

Closed clumsy closed 7 months ago

clumsy commented 7 months ago

Describe the bug, including details regarding any error messages, version, and platform.

The sample from the offical docs does not seem to work in v15.0.0

fs.copy_files("s3://registry.opendata.aws/roda/ndjson/index.ndjson",
              "file:///{}/index_copy.ndjson".format(local_path))

With local_path = "/tmp/fs_test" I get:

In [15]: local_path = "/tmp/fs_test"

In [16]: fs.copy_files("s3://registry.opendata.aws/roda/ndjson/index.ndjson",
    ...:               "file:///{}/index_copy.ndjson".format(local_path))
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[16], line 1
----> 1 fs.copy_files("s3://registry.opendata.aws/roda/ndjson/index.ndjson",
      2               "file:///{}/index_copy.ndjson".format(local_path))

File ~/.local/lib/python3.10/site-packages/pyarrow/fs.py:276, in copy_files(source, destination, source_filesystem, destination_filesystem, chunk_size, use_threads)
    272     _copy_files_selector(source_fs, source_sel,
    273                          destination_fs, destination_path,
    274                          chunk_size, use_threads)
    275 else:
--> 276     _copy_files(source_fs, source_path,
    277                 destination_fs, destination_path,
    278                 chunk_size, use_threads)

File ~/.local/lib/python3.10/site-packages/pyarrow/_fs.pyx:1614, in pyarrow._fs._copy_files()

File ~/.local/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

FileNotFoundError: [Errno 2] Failed to open local file '//tmp/fs_test/index_copy.ndjson'. Detail: [errno 2] No such file or directory

I not a C++ expert but I tried looking into the implementation up until pyarrow tries to create to call OpenOutputStream using the destination path (in the error). But it does not exist (prior the call), pyarrow didn't create it and it looks as if pyarrow also does not tolerate missing it.

This looks like a bug, but I wanted to double-check that the users are not expected to create all the files in target file system for example.

Component(s)

Python

clumsy commented 7 months ago

Ok, in this particular case the issue is that the local_path directory didn't exist, so if I create it and only copy files/empty directories inside it works. But what will create nested directories when this works recursively (as advertised in the documentation)? I currently get the same error but for the nested path. I don't even know these for example when I iterate non-recursive file selector and pass top level paths to copy_files.