apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.3k stars 3.48k forks source link

[Python] Writing to cloudflare r2 fails for mutlipart upload #34363

Closed legout closed 1 year ago

legout commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

When I try to write a pyarrow.table to cloudflare r2 object store, I got an error when files are larger than x MB (I do not know the exact threshold) and pyarrow internally switches to multipart uploading.

I´ve used s3fs.S3FileSystem (fsspec) and also tried pyarrow.fs.S3FileSystem. Here is some example code:

import pyarrow.fs as pafs
import s3fs
import pyarrow.parquet as pq
import pyarrow as pa

table_small = pq.read_table("small_data.parquet")
table_large = pq.read_table("large_data.parquet")

fs1 = s3fs.S3FileSystem(
  key="some_key", 
  secret="some_secret", 
  client_kwargs=dict(endpoint_url="https://123456.r2.cloudflarestorage.com"), 
  s3_additional_kwargs=dict(ACL="private") # <- this is neccessary for writing. 
) 

fs2 = pafs.S3FileSystem(
  access_key="some_key", 
  secret_key="some_secret", 
  endpoint_override="https://123456.r2.cloudflarestorage.com"
)

pq.write_table(table_small, "test/test.parquet", filesystem=fs1) # <- works 
pq.write_table(table_small, "test/test.parquet", filesystem=fs2) # <- works 

#  failed with OSError: [Errno 22] There was a problem with the multipart upload. 
pq.write_table(table_large, "test/test.parquet", filesystem=fs1) 

# failed with OSError: When initiating multiple part upload for key 'test.parquet' in bucket 'test': AWS Error NETWORK_CONNECTION during CreateMultipartUpload operation: curlCode: 28, Timeout was reached
pq.write_table(table_large, "test/test.parquet", filesystem=fs2) 

Platform

Linux x86

Versions

pyarrow 11.0.0 s3fs 2023.1.0

Component(s)

Parquet, Python

westonpace commented 1 year ago

pyarrow.fs.S3FileSystem will always use multi-part upload. However, it seems that multi-part upload is timing out? It seems odd that CreateMultipartUpload would fail. This request should be trivial as it is simply setting up the upload and doesn't involve any transmission of data.

legout commented 1 year ago

Ok. Thanks for the information.

Maybe this might be helpful.

https://github.com/duckdb/duckdb/pull/6439 https://github.com/duckdb/duckdb/issues/5685

https://developers.cloudflare.com/r2/data-access/s3-api/api/

westonpace commented 1 year ago

Thank you for the extra information but I'm not sure it is sufficient. The duckdb issues do not apply. DuckDb has (I think) implemented their own S3 client. Arrow uses the SDK. So if there is a bug in URL construction it is an SDK problem and not an Arrow problem. I would be surprised if this were the cause.

The R2 page seems to suggest that CreateMultipartUpload is supported. Also, if this were an R2 compatibility problem then I would expect an "invalid request" error. The error you are getting is "timeout was reached" which seems more likely to be caused by a poor internet connection, a firewall, or a misconfiguration.

legout commented 1 year ago

I think it is somehow related to R2, because I am able to run this script (https://github.com/apache/arrow/issues/34363#issue-1601167816) using other s3 object stores (tested with wasabi, contabo s3, idrive e2, storj, aws s3 and self hosted minio) without any problems.

One more information. I am even able to upload the large_data.parquet using s3fs with fs1.put_file("large_data.parquet", "test/test.parquet") and it is also possible with the AWS cli (also uses the SDK ??).

I do understand, that this issue can not be solved within arrow, therefore, we can probably close this here. However, I´d like to find out what causes this error. Is it possible to run pyarrow commands in a "debugging mode" to get more details?

legout commented 1 year ago

I run this script (https://github.com/apache/arrow/issues/34363#issue-1601167816) again and it failed with another OsError

OSError: When completing multiple part upload for key 'test2.parquet' in bucket 'test': AWS Error UNKNOWN (HTTP status 400) during CompleteMultipartUpload operation: Unable to parse ExceptionName: InvalidPart Message: There was a problem with the multipart upload.

westonpace commented 1 year ago

I´d like to find out what causes this error. Is it possible to run pyarrow commands in a "debugging mode" to get more details?

Try running this before you do anything (before you import pyarrow.fs):

import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Trace)

You can also try log levels Debug, Info, Warn. I think it logs to stdout or stderr.

legout commented 1 year ago

Here are some new information from a Cloudflare developer.

https://github.com/duckdb/duckdb/issues/5685

wjones127 commented 1 year ago

@elithrar in https://github.com/duckdb/duckdb/issues/5685#issuecomment-1456641554 you said

If the parts aren't equal (minus the last), then the multi-part upload will fail

Should we take that to mean that R2 requires that the parts, except for the last one, must have an equal size in bytes? We treat the part size as a minimum and dynamically increase the size of the parts. This works well for S3.

https://github.com/apache/arrow/blob/28ca876dc41b696bda8159daa1c4e9be2b799c48/cpp/src/arrow/filesystem/s3fs.cc#L1410-L1423

wjones127 commented 1 year ago

FWIW I am able to reproduce the error by uploading one part of 7MB and one with 6MB, so it does seem like this is the issue.

westonpace commented 1 year ago

Good catch

legout commented 1 year ago

FWIW I am able to reproduce the error by uploading one part of 7MB and one with 6MB, so it does seem like this is the issue.

I am able to upload files below 10MB. For example, I was able to a file with 9.64MB, but failed for a 10.14MB file.

wjones127 commented 1 year ago

It seems like we can either

  1. Wait for Cloudflare to update R2 to have the same behavior as S3 and Minio
  2. Add a configuration option to set the multi-part uploads to always use exactly equal part sizes.

It's possible (2) might be rather invasive, so I'd rather wait a bit to see if (1) happens. But if someone is really motivated to get R2 working now, I think we would accept a PR implementing (2) now.

pitrou commented 1 year ago

@elithrar Sorry for the ping, but would you have the time to answer @wjones127 's question here?

elithrar commented 1 year ago

(For context: I work at Cloudflare)

Here's the latest:

This doesn't mean we're not open to relaxing this, but it's a non-trivial change.

westonpace commented 1 year ago

This seems like a legitimate request and pretty workable. We are pretty close already. The code in ObjectOutputStream is roughly...

if request > part_limit:
  submit_request(request)
  return
buffer.append(request)
if buffer > part_limit:
  submit_request(buffer)
  buffer.reset()

Given we are already talking about cloud upload and I/O I think we can just directly implement the equal parts approach (instead of trying to maintain both) without too much hit to performance (though there will be some hit since this introduces a mandatory extra copy of the data in some cases). This would change the above logic to:

buffer.append(request)
for chunk in slice_off_whole_chunks(buffer, part_limit):
 submit_request(chunk)

Does anyone want to create a PR?

wjones127 commented 1 year ago

I'm looking into this now.

elithrar commented 1 year ago

Thank you, @wjones127!

BitPhinix commented 2 weeks ago

Still happens on latest with large files from time to time