apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
359 stars 135 forks source link

Bucket name getting appended to minIO service name #908

Open ArijitSinghEDA opened 1 month ago

ArijitSinghEDA commented 1 month ago

Question

I am running iceberg in a dockerized environment and using rest catalog and storing table details as parquet file using pyarrow on a local minIO server under the bucket "iceberg-bucket".

When using the IP address, everything is going fine, i.e.,

from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "iceberg_rest_catalog",
    **{
        "uri": "http://0.0.0.0:8228",
        "s3.endpoint": "http://0.0.0.0:9033",
        "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
        "s3.access-key-id": "user",
        "s3.secret-access-key": "password",
        "s3.region": "us-east-1"
    }
)

But, when I use service names (as per the docker-compose.yaml files for both Iceberg and MinIO), I face the issue

pyiceberg.exceptions.ServerError: SdkClientException: Received an UnknownHostException when attempting to interact with a service. See cause for the exact endpoint that is failing to resolve. If this is happening on an endpoint that previously worked, there may be a network connectivity issue or your DNS cache could be storing endpoints for too long.

This is how I initialize catalog with service names

from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "iceberg_rest_catalog",
    **{
        "uri": "http://rest-catalog:8252",
        "s3.endpoint": "http://iceberg-minio:9044",
        "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
        "s3.access-key-id": "user",
        "s3.secret-access-key": "password",
        "s3.region": "us-east-1"
    }
)

On further investigation, the reason this happens because it tries to access the MinIO server with the bucket name prefixed, i.e., iceberg-bucket.iceberg-minio, which should not be the case.

kevinjqliu commented 1 month ago

What does your docker-compose.yaml look like? It's likely a configuration issue. I'd suggest starting with a known working docker configuration (such as https://github.com/apache/iceberg-python/blob/main/dev/docker-compose-integration.yml#L57) and work from there

Based on docker-compose-integration.yml above, there are several minio specific settings needed

ArijitSinghEDA commented 1 month ago

Hi @kevinjqliu The example shared is very insightful, but my issue is that I have a MinIO service serving already for other tasks as well, and I want to access this MinIO service only, rather than creating a new one. All my dockers have different compose files, but all are running on the same network, even the docker for pyiceberg as well. I am just unable to find a reason why is it prefixing the bucket name to the service name here.

As for the docker-compose.yaml file I am using:

version: "2"
services:
  local-pyiceberg:
    build: .
    container_name: local-pyiceberg
    ports:
      - "8046:80"
    volumes:
      - /opt/local/:/opt/local
networks:
    default:
        external:
            name: local-zone_default
kevinjqliu commented 1 month ago

@ArijitSinghEDA Something I noticed about the error message

pyiceberg.exceptions.ServerError: SdkClientException: Received an UnknownHostException when attempting to interact with a service. See cause for the exact endpoint that is failing to resolve. If this is happening on an endpoint that previously worked, there may be a network connectivity issue or your DNS cache could be storing endpoints for too long.

Specifically ServerError, suggests that this is an issue with the REST server https://github.com/search?q=repo%3Aapache%2Ficeberg-python%20ServerError&type=code

The load_catalog code above is considered the "client" code. The real issue might be with the REST server

ArijitSinghEDA commented 1 month ago

@kevinjqliu yes, I concur that too. Like I said before, in the REST server only it prefixed the bucket name to the MinIO service name, due to which it is unable to make any connection to the MinIO server.

kevinjqliu commented 1 month ago

Likely an issue with path-style vs virtual-hosted-style s3 access https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#virtual-hosted-style-access

Maybe a s3 config on the server side https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#addressing-style