apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.43k stars 3.51k forks source link

[C++] Listing files with S3FileSystem is slow #25019

Open asfimport opened 4 years ago

asfimport commented 4 years ago

Listing files on S3 is slow due to the recursive nature of the algorithm.

The following change modifies the behavior of the S3Result to include all objects but no "grouping" (directories). This lower dramatically the number of HTTP calls.


diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index 70c87f46ec..98a40b17a2 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -986,7 +986,7 @@ class S3FileSystem::Impl {
     if (!prefix.empty()) {
       req.SetPrefix(ToAwsString(prefix) + kSep);
     }
-    req.SetDelimiter(Aws::String() + kSep);
+    // req.SetDelimiter(Aws::String() + kSep);
     req.SetMaxKeys(kListObjectsMaxKeys);

     while (true) {

The suggested change is to add an option to Selector, e.g. no_directory_result or something like this.

Reporter: Francois Saint-Jacques / @fsaintjacques

Related issues:

Note: This issue was originally created as ARROW-8884. Please see the migration documentation for further details.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Related: ARROW-10788

westonpace commented 1 year ago

Mentioned in #34213

I have no idea what the implications are.

@pitrou

I attempted some investigation. In the AWS CLI it appears that the delimiter is only used when the listing is non-recursive.

In S3FS the delimiter is skipped when looking for a file recursively.

In the S3 docs it states:

If you issue a list request with a delimiter, you can browse your hierarchy at only one level, skipping over and summarizing the (possibly millions of) keys nested at deeper levels.

My conclusion is that the delimiter's purpose is to reduce the number of files returned when you do not need to retrieve all the files. If we are doing a recursive listing then I think it is consistent with other projects and S3's intentions that we do not specify the delimiter.