Open ryanovas opened 2 weeks ago
Thanks for raising awareness here @ryanovas. It looks like the lock is at 1.36.1:
This is interesting since we test against minio
in our tests, and it didn't show up. Do you know how to replicate this?
In my case I was using a digital ocean spaces bucket and lakekeeper then tried to follow the getting started guide from the docs with the taxi data. When I got to the append part I started getting the errors.
This occurs for me as well against an OCI Object Storage using S3 compat api. With Boto3 request I can set the config to the necessary config to disable the checksum validation they added. https://github.com/boto/boto3/issues/4398#issuecomment-2619946229
config = Config(
request_checksum_calculation='WHEN_REQUIRED', # required for Content-Length error
response_checksum_validation='WHEN_REQUIRED', # required for Content-Length error,
)
The issue on botocore was closed because there are apparently params you can pass to turn off the new signing stuff. But I'm not sure how we can apply those flags here given botocore is so many layers deep. Perhaps we just need to wait until S3-compatible buckets update their headers to match? That doesn't seem ideal though. We can't stay on 1.35 forever
The issue on botocore was closed because there are apparently params you can pass to turn off the new signing stuff. But I'm not sure how we can apply those flags here given botocore is so many layers deep.
we pass configs down to s3fs, i think it would be useful to allow passing arbitrary configs to s3fa so that we dont need to explicit define all of s3fs S3FileSystem args
@ryanovas do you know if that option is available in s3fs yet? i couldnt find a match
I think that would be the move as well, let those of us who are on s3-compatible pass in anything extra we need.
Sadly I don't know much about s3fs. A quick search of their issues brings up this: https://github.com/fsspec/s3fs/issues/855 but I don't think that's the same error (or that just turning off SSL like they suggest is ever a good idea).
It might be necessary to open an issue with them if the option isn't present in their docs. I can try to help out with this more next week.
It is discussed briefly here https://github.com/fsspec/s3fs/issues/931 though it might be best to track as a separate issue if necessary. It looks from the doc that it should be passed into config_kwargs
in s3fs and if arbitrary args are allowed in pyiceberg they can be left up to the user.
I'm open to locking it to <1.36
since a lot of people are experiencing issues here.
IMO locking the version seems like an ok temp fix, but the better fix is likely to figure out how to get the extra params down to botocore. I commented in the s3fs issue with a link to the issue in botocore that mentions the configs to disable this on the botocore side.
Edit: According to the s3fs people there is an env var we can try, and if it fixes the issue they will figure out a way to expose it programmatically - then on the pyiceberg side it should be possible to resolve this without locking to an issue. I will attempt to try this in the next day or 2 unless someone beats me to it.
Just an update, setting the environment variable as such fixes the issue: export AWS_REQUEST_CHECKSUM_CALCULATION='WHEN_REQUIRED'
I updated the s3fs issue as well. I guess now we wait for next steps. This will fix it for people in the meantime however without having to downgrade.
In the s3fs issue it looks like there's a config we can pass? I believe it was suggested already by @maarek but confirmed in this comment on the issue: https://github.com/fsspec/s3fs/issues/931#issuecomment-2637565349
Thanks for keeping track of this @ryanovas. I think we can do an if-else based on the s3fs
version that's installed. In the case of ≥1.36
, then we should pass in this parameter.
The relevant code can be found here: https://github.com/apache/iceberg-python/blob/dfbee4b0023ff8442c7d27194dc2edc66b15142d/pyiceberg/io/fsspec.py#L157
I might suggest we allow for a way to set a flag maybe when we init a catalog? or something similar as the issue is only present for s3-compatible storage and AWS peeps probably want to keep the extra signing headers. Also, it's botocore that has the version incompatibility which is a couple layers deep (s3fs -> aibotocore -> botocore) so it feels a little fragile imo to tack it to the version of botocore.
If you agree, and can point me to where we could add this extra flag as well, I could try to make a PR to address this. Also if there's a contributing guide anywhere.
Apache Iceberg version
0.8.1 (latest release)
Please describe the bug 🐞
Here is a link to the relevant boto issue: https://github.com/boto/boto3/issues/4398
Attempting to use
table.append
ortable.overwrite
when using the 1.36.x version installed by default with s3fs causes a very confusing errorbotocore.exceptions.ClientError: An error occurred (MissingContentLength) when calling the PutObject operation: None
The workaround is to downgrade back to botocore 1.35.99 manually.
I'm unsure if there's work on the pyiceberg side to resolve this but I am posting this issue for others like me googling this issue and struggling to find answers.
I might also suggest pyiceberg consider a library that doesn't rely on AWS controlled boto and uses something more like s2cmd that is more S3-agnostic for those of us using things like Digital Ocean, minio, backblaze, etc.