Open kevinjqliu opened 2 weeks ago
Maybe similar issue for GCS/Azure, since we only cached 1 instance of each FileSystem
Hey @kevinjqliu, I will happy to work on this task. Thanks
@danhphan assigned to you! LMK if you have any questions
Thanks @kevinjqliu , I'm reading the code base.
Can you please give me an example of expected unit-tests for the feature if possible? For instance, if we create the follow s3_fileio
with "s3.region": "us-east-1" in the session_properties
. Then we create an input_file
on s3 bucket of warehouse
, which is actually located in "eu-central-1" region, what should be the expected results?
session_properties: Properties = {
"s3.endpoint": "http://localhost:9000",
"s3.access-key-id": "admin",
"s3.secret-access-key": "password",
"s3.region": "us-east-1",
"s3.session-token": "s3.session-token",
**UNIFIED_AWS_SESSION_PROPERTIES,
}
s3_fileio = PyArrowFileIO(properties=session_properties)
print(s3_fileio.properties['s3.region']) #--> us-east-1
filename = str(uuid.uuid4())
input_file = s3_fileio.new_input(location=f"s3://warehouse/{filename}")
print(pyarrow.fs.resolve_s3_region('warehouse')) #--> eu-central-1
output_file = s3_fileio.new_output(location=f"s3://foo/{filename}")
print(pyarrow.fs.resolve_s3_region('foo')) #--> us-east-1
I'm thinking may be in the def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem
in your above comments, we can assign the value for "region" in client_kwargs
based on the value of netloc
(or s3 bucket), but not sure if it is the right direction.
Like: "region": pyarrow.fs.resolve_s3_region(netloc),
def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem:
if scheme in {"s3", "s3a", "s3n"}:
from pyarrow.fs import S3FileSystem
client_kwargs: Dict[str, Any] = {
"endpoint_override": self.properties.get(S3_ENDPOINT),
"access_key": get_first_property_value(self.properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
"secret_key": get_first_property_value(self.properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
"session_token": get_first_property_value(self.properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN),
"region": get_first_property_value(self.properties, S3_REGION, AWS_REGION),
}
Thank you.
what should be the expected results?
Given 2 files in different regions, I want to read them transparently without knowing which region they belong. Currently, we create a single PyArrow FS for S3 that is region-specific. We can create many region-specific PyArrow FS or create them on the fly.
pyarrow.fs.resolve_s3_region
can help determine which region to use. However, we cannot currently override the endpoint for minio in tests. See apache/arrow#43713
"s3.region": "us-east-1",
Perhaps we also need to think about configuration as well. Setting the region
property here assumes that the FileIO will be specific to a region.
Apache Iceberg version
None
Please describe the bug 🐞
Problem
I want to read files from multiple s3 regions. For example, my metadata files are in
us-west-2
but my data files are inus-east-1
. This is not possible currently.Context
Reading a file in
pyarrow
requires alocation
and a file system implementation,fs
. For example,location="s3://blah/foo.parquet"
andfs=S3FileSystem
. https://github.com/apache/iceberg-python/blob/0cebec48833f75eeca02b1a965112615b1cbc1c8/pyiceberg/io/pyarrow.py#L404-L419The
fs
is used to access the files in s3. And is initialized with the givenS3_REGION
according to the S3 configuration. https://github.com/apache/iceberg-python/blob/0cebec48833f75eeca02b1a965112615b1cbc1c8/pyiceberg/io/pyarrow.py#L347-L365This means only 1 S3 region is allowed.
Possible Solution
Create multiple instances of
S3FileSystem
, one for each region. And fetch the corresponding instance based onlocation
.pyarrow.fs.resolve_s3_region(bucket)
can determine the correct region