Closed legout closed 1 year ago
pyarrow.fs.S3FileSystem
will always use multi-part upload. However, it seems that multi-part upload is timing out? It seems odd that CreateMultipartUpload
would fail. This request should be trivial as it is simply setting up the upload and doesn't involve any transmission of data.
Ok. Thanks for the information.
Maybe this might be helpful.
https://github.com/duckdb/duckdb/pull/6439 https://github.com/duckdb/duckdb/issues/5685
https://developers.cloudflare.com/r2/data-access/s3-api/api/
Thank you for the extra information but I'm not sure it is sufficient. The duckdb issues do not apply. DuckDb has (I think) implemented their own S3 client. Arrow uses the SDK. So if there is a bug in URL construction it is an SDK problem and not an Arrow problem. I would be surprised if this were the cause.
The R2 page seems to suggest that CreateMultipartUpload is supported. Also, if this were an R2 compatibility problem then I would expect an "invalid request" error. The error you are getting is "timeout was reached" which seems more likely to be caused by a poor internet connection, a firewall, or a misconfiguration.
I think it is somehow related to R2, because I am able to run this script (https://github.com/apache/arrow/issues/34363#issue-1601167816) using other s3 object stores (tested with wasabi, contabo s3, idrive e2, storj, aws s3 and self hosted minio) without any problems.
One more information. I am even able to upload the large_data.parquet
using s3fs
with fs1.put_file("large_data.parquet", "test/test.parquet")
and it is also possible with the AWS cli (also uses the SDK ??).
I do understand, that this issue can not be solved within arrow, therefore, we can probably close this here. However, I´d like to find out what causes this error. Is it possible to run pyarrow commands in a "debugging mode" to get more details?
I run this script (https://github.com/apache/arrow/issues/34363#issue-1601167816) again and it failed with another OsError
OSError: When completing multiple part upload for key 'test2.parquet' in bucket 'test': AWS Error UNKNOWN (HTTP status 400) during CompleteMultipartUpload operation: Unable to parse ExceptionName: InvalidPart Message: There was a problem with the multipart upload.
I´d like to find out what causes this error. Is it possible to run pyarrow commands in a "debugging mode" to get more details?
Try running this before you do anything (before you import pyarrow.fs
):
import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Trace)
You can also try log levels Debug
, Info
, Warn
. I think it logs to stdout or stderr.
Here are some new information from a Cloudflare developer.
@elithrar in https://github.com/duckdb/duckdb/issues/5685#issuecomment-1456641554 you said
If the parts aren't equal (minus the last), then the multi-part upload will fail
Should we take that to mean that R2 requires that the parts, except for the last one, must have an equal size in bytes? We treat the part size as a minimum and dynamically increase the size of the parts. This works well for S3.
FWIW I am able to reproduce the error by uploading one part of 7MB and one with 6MB, so it does seem like this is the issue.
Good catch
FWIW I am able to reproduce the error by uploading one part of 7MB and one with 6MB, so it does seem like this is the issue.
I am able to upload files below 10MB. For example, I was able to a file with 9.64MB, but failed for a 10.14MB file.
It seems like we can either
It's possible (2) might be rather invasive, so I'd rather wait a bit to see if (1) happens. But if someone is really motivated to get R2 working now, I think we would accept a PR implementing (2) now.
@elithrar Sorry for the ping, but would you have the time to answer @wjones127 's question here?
(For context: I work at Cloudflare)
Here's the latest:
This doesn't mean we're not open to relaxing this, but it's a non-trivial change.
This seems like a legitimate request and pretty workable. We are pretty close already. The code in ObjectOutputStream is roughly...
if request > part_limit:
submit_request(request)
return
buffer.append(request)
if buffer > part_limit:
submit_request(buffer)
buffer.reset()
Given we are already talking about cloud upload and I/O I think we can just directly implement the equal parts approach (instead of trying to maintain both) without too much hit to performance (though there will be some hit since this introduces a mandatory extra copy of the data in some cases). This would change the above logic to:
buffer.append(request)
for chunk in slice_off_whole_chunks(buffer, part_limit):
submit_request(chunk)
Does anyone want to create a PR?
I'm looking into this now.
Thank you, @wjones127!
Still happens on latest with large files from time to time
Describe the bug, including details regarding any error messages, version, and platform.
When I try to write a pyarrow.table to cloudflare r2 object store, I got an error when files are larger than x MB (I do not know the exact threshold) and pyarrow internally switches to multipart uploading.
I´ve used
s3fs.S3FileSystem
(fsspec) and also triedpyarrow.fs.S3FileSystem
. Here is some example code:Platform
Linux x86
Versions
pyarrow 11.0.0 s3fs 2023.1.0
Component(s)
Parquet, Python