[bug] read from multiple s3 regions

kevinjqliu commented 2 weeks ago

Apache Iceberg version

None

Please describe the bug 🐞

Problem

I want to read files from multiple s3 regions. For example, my metadata files are in us-west-2 but my data files are in us-east-1. This is not possible currently.

Context

Reading a file in pyarrow requires a location and a file system implementation, fs. For example, location="s3://blah/foo.parquet" and fs=S3FileSystem. https://github.com/apache/iceberg-python/blob/0cebec48833f75eeca02b1a965112615b1cbc1c8/pyiceberg/io/pyarrow.py#L404-L419

The fs is used to access the files in s3. And is initialized with the given S3_REGION according to the S3 configuration. https://github.com/apache/iceberg-python/blob/0cebec48833f75eeca02b1a965112615b1cbc1c8/pyiceberg/io/pyarrow.py#L347-L365

This means only 1 S3 region is allowed.

Possible Solution

Create multiple instances of S3FileSystem, one for each region. And fetch the corresponding instance based on location. pyarrow.fs.resolve_s3_region(bucket) can determine the correct region

kevinjqliu commented 2 weeks ago

Maybe similar issue for GCS/Azure, since we only cached 1 instance of each FileSystem

danhphan commented 6 days ago

Hey @kevinjqliu, I will happy to work on this task. Thanks

kevinjqliu commented 6 days ago

@danhphan assigned to you! LMK if you have any questions

danhphan commented 5 days ago

Thanks @kevinjqliu , I'm reading the code base.

Can you please give me an example of expected unit-tests for the feature if possible? For instance, if we create the follow s3_fileio with "s3.region": "us-east-1" in the session_properties. Then we create an input_file on s3 bucket of warehouse, which is actually located in "eu-central-1" region, what should be the expected results?

session_properties: Properties = {
    "s3.endpoint": "http://localhost:9000",
    "s3.access-key-id": "admin",
    "s3.secret-access-key": "password",
    "s3.region": "us-east-1",
    "s3.session-token": "s3.session-token",
    **UNIFIED_AWS_SESSION_PROPERTIES,
}

s3_fileio = PyArrowFileIO(properties=session_properties)
print(s3_fileio.properties['s3.region']) #--> us-east-1

filename = str(uuid.uuid4())
input_file = s3_fileio.new_input(location=f"s3://warehouse/{filename}")
print(pyarrow.fs.resolve_s3_region('warehouse')) #--> eu-central-1

output_file = s3_fileio.new_output(location=f"s3://foo/{filename}")
print(pyarrow.fs.resolve_s3_region('foo')) #--> us-east-1

I'm thinking may be in the def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem in your above comments, we can assign the value for "region" in client_kwargs based on the value of netloc (or s3 bucket), but not sure if it is the right direction.

Like: "region": pyarrow.fs.resolve_s3_region(netloc),

 def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem: 
     if scheme in {"s3", "s3a", "s3n"}: 
         from pyarrow.fs import S3FileSystem 

         client_kwargs: Dict[str, Any] = { 
             "endpoint_override": self.properties.get(S3_ENDPOINT), 
             "access_key": get_first_property_value(self.properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID), 
             "secret_key": get_first_property_value(self.properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY), 
             "session_token": get_first_property_value(self.properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN), 
             "region": get_first_property_value(self.properties, S3_REGION, AWS_REGION), 
         }

Thank you.

kevinjqliu commented 5 days ago

what should be the expected results?

Given 2 files in different regions, I want to read them transparently without knowing which region they belong. Currently, we create a single PyArrow FS for S3 that is region-specific. We can create many region-specific PyArrow FS or create them on the fly.

pyarrow.fs.resolve_s3_region can help determine which region to use. However, we cannot currently override the endpoint for minio in tests. See apache/arrow#43713

"s3.region": "us-east-1",

Perhaps we also need to think about configuration as well. Setting the region property here assumes that the FileIO will be specific to a region.

apache / iceberg-python