fsspec / s3fs

S3 Filesystem
http://s3fs.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
892 stars 274 forks source link

Checksum failing with Multipart Uploads #672

Open coolacid opened 1 year ago

coolacid commented 1 year ago

When sending multi-part uploads with a S3 Integrity Checksum, it fails with an error indicating not all parts have the checksum enabled.

I was able to enable ChecksumAlgorithm by adding a s3_additional_kwargs to the S3FileSystem initialisation. ex:

s3 = s3fs.S3FileSystem(key = access_key, secret = secret_key, s3_additional_kwargs = {"ChecksumAlgorithm": "SHA256"})

When sending a larger file using Multi-Part Uploads, it yields the following trace.

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.9/site-packages/s3fs/core.py", line 112, in _error_wrapper
    return await func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/aiobotocore/client.py", line 358, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidRequest) when calling the CompleteMultipartUpload operation: The upload was created using a sha256 checksum. The complete request must include the checksum for each part. It was missing for part 1 in the request.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/....", line 97, in execute
    fobj_construct2parquet(incoming, content_id, filesystem=s3fs)
  File "/opt/.....", line 67, in fobj_construct2parquet
    pq.write_table(table, fobj, filesystem=filesystem)
  File "/home/airflow/.local/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2941, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/home/airflow/.local/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 956, in __exit__
    self.close()
  File "/home/airflow/.local/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 1029, in close
    self.file_handle.close()
  File "pyarrow/io.pxi", line 180, in pyarrow.lib.NativeFile.close
  File "/home/airflow/.local/lib/python3.9/site-packages/fsspec/spec.py", line 1741, in close
    self.flush(force=True)
  File "/home/airflow/.local/lib/python3.9/site-packages/fsspec/spec.py", line 1612, in flush
    if self._upload_chunk(final=force) is not False:
  File "/home/airflow/.local/lib/python3.9/site-packages/s3fs/core.py", line 2173, in _upload_chunk
    self.commit()
  File "/home/airflow/.local/lib/python3.9/site-packages/s3fs/core.py", line 2201, in commit
    write_result = self._call_s3(
  File "/home/airflow/.local/lib/python3.9/site-packages/s3fs/core.py", line 2040, in _call_s3
    return self.fs.call_s3(method, self.s3_additional_kwargs, *kwarglist, **kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/fsspec/asyn.py", line 113, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/fsspec/asyn.py", line 98, in sync
    raise return_result
  File "/home/airflow/.local/lib/python3.9/site-packages/fsspec/asyn.py", line 53, in _runner
    result[0] = await coro
  File "/home/airflow/.local/lib/python3.9/site-packages/s3fs/core.py", line 339, in _call_s3
    return await _error_wrapper(
  File "/home/airflow/.local/lib/python3.9/site-packages/s3fs/core.py", line 139, in _error_wrapper
    raise err
OSError: [Errno 22] The upload was created using a sha256 checksum. The complete request must include the checksum for each part. It was missing for part 1 in the request.
coolacid commented 1 year ago

Digging.

When the parts are uploaded to S3, S3 will return the corresponding checksum. Ex (from a Debugging Print statement);

Line 2174: out={'ResponseMetadata': {'RequestId': '...', 'HostId': '...', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': '...', 'x-amz-request-id': '...', 'date': 'Sun, 11 Dec 2022 18:16:14 GMT', 'etag': '"..."', 'x-amz-checksum-sha256': 'JYVhopFfH6xHs34sIajC/UdtzVojFMP1zktFGsGw8h0=', 'x-amz-server-side-encryption': 'AES256', 'server': 'AmazonS3', 'content-length': '0', 'connection': 'close'}, 'RetryAttempts': 0}, 'ServerSideEncryption': 'AES256', 'ETag': '"..."', 'ChecksumSHA256': 'JYVhopFfH6xHs34sIajC/UdtzVojFMP1zktFGsGw8h0='}

This checksum should be replied back in the CompleteMultipartUpload call. See https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html

Doing a simple dirty test adding the ChecksumSHA256 here https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L2171 gets me a working system.

martindurant commented 1 year ago

Thanks for looking into this. It sounds like you have a solution - should this be put into a PR?

coolacid commented 1 year ago

As much as I'd like to PR this, my method wouldn't be as clean as the rest of the code.

martindurant commented 1 year ago

It is worthwhile having something that works! Perhaps I can help make is fit with the style of the rest of the code, or else it can serve as a public workaround for those that need it.