apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
479 stars 177 forks source link

PyIceberg - MetaException(message='java.lang.IllegalArgumentException: bucket is null/empty') #1165

Open malopezh opened 2 months ago

malopezh commented 2 months ago

Apache Iceberg version

0.7.1 (latest release)

Please describe the bug 🐞

Problem: Trying to create table in OCI Object Storage. Metadata is successfully created but data is not. Expected: Iceberg structure created but just metadata is being created. StackTrace: Traceback (most recent call last): File "/home/marcolo/development/reorgParquets/.venv/lib/python3.10/site-packages/pyiceberg/catalog/__init__.py", line 418, in create_table_if_not_exists return self.create_table(identifier, schema, location, partition_spec, sort_order, properties) File "/home/marcolo/development/reorgParquets/.venv/lib/python3.10/site-packages/pyiceberg/catalog/hive.py", line 376, in create_table self._create_hive_table(open_client, tbl) File "/home/marcolo/development/reorgParquets/.venv/lib/python3.10/site-packages/pyiceberg/catalog/hive.py", line 325, in _create_hive_table open_client.create_table(hive_table) File "/home/marcolo/development/reorgParquets/.venv/lib/python3.10/site-packages/hive_metastore/ThriftHiveMetastore.py", line 3431, in create_table self.recv_create_table() File "/home/marcolo/development/reorgParquets/.venv/lib/python3.10/site-packages/hive_metastore/ThriftHiveMetastore.py", line 3457, in recv_create_table raise result.o3 hive_metastore.ttypes.MetaException: MetaException(message='java.lang.IllegalArgumentException: bucket is null/empty')

Iceberg Catalog:

`local_catalog = load_catalog(name='s3', uri="thrift://localhost:9083", warehouse= "s3a://my_bucket", catalog_type= "hadoop",

                        **{
                            "s3.endpoint": "https://my-endpoint.com",
                            "s3.access-key-id": "myAccess-Key",
                            "s3.secret-access-key": "mySecret-Key",
                            "s3.session.token":"myToken",
                            "bucket_name": "s3a://my_bucket",
                            "hive.hive2-compatible": "true",
                            }
                         )`
kevinjqliu commented 2 months ago

thanks for reporting this. can you add an example code of how you created the table?

malopezh commented 2 months ago

thanks for reporting this. can you add an example code of how you created the table?

Hello!

Sure here you have the code:

` schema = Schema( NestedField(field_id=1, name="datetime", field_type=StringType(), required=False,current_schema=1), NestedField(field_id=2, name="symbol", field_type=StringType(), required=False,current_schema=1), NestedField(field_id=3, name="bid", field_type=FloatType(), required=False,current_schema=1), NestedField(field_id=4, name="ask", field_type=DoubleType(), required=False,current_schema=1), )

partition_spec = PartitionSpec( PartitionField( source_id=1, field_id=1000, transform=DayTransform(), name="datetime_day" ) )

from pyiceberg.table.sorting import SortOrder, SortField from pyiceberg.transforms import IdentityTransform sort_order = SortOrder(SortField(source_id=2, transform=IdentityTransform()))

identifier = ("iceberg", "default")

tbl = local_catalog.create_table_if_not_exists(identifier=identifier, schema=schema, location="s3a://my_oci_bucket/my_folder", partition_spec=partition_spec, sort_order=sort_order, properties={})

tbl.overwrite(df) `

NOTE: metadata is being created successfully

Thanks!!

kevinjqliu commented 2 months ago

line 3457, in recv_create_table raise result.o3 hive_metastore.ttypes.MetaException: MetaException(message='java.lang.IllegalArgumentException: bucket is null/empty')

This error is not from pyiceberg, but possibly from your underlying (hadoop) fs https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java#L1102

catalog_type= "hadoop",

This is also not a valid catalog type https://github.com/apache/iceberg-python/blob/40357476ad0a79f7486b96a6e29b404bc699b70d/pyiceberg/catalog/__init__.py#L177-L183

"bucket_name": "s3a://my_bucket",

bucket_name is not a valid parameter for catalog https://py.iceberg.apache.org/configuration/#catalogs

kevinjqliu commented 2 months ago

uri="thrift://localhost:9083",

Is this a HMS? I think the error is from the HMS setup

malopezh commented 2 months ago

uri="thrift://localhost:9083",

Is this a HMS? I think the error is from the HMS setup

Yes it's HMS. I configured Hadoop, Hive and HiveMetaStore service also I configured MySQL. I was able to create a new namespace with local_catalog.create_namespace("myNS") but it's obvious that I missed something.

My intention is creating Iceberg Tables in OCI Object Storage. Is there any documentation I can check to achieve this?

Thanks for your responses.

kevinjqliu commented 2 months ago

My intention is creating Iceberg Tables in OCI Object Storage. Is there any documentation I can check to achieve this?

I don't know any OCI related documentation. However, here's one on setting up a catalog and writing to it. https://py.iceberg.apache.org/#connecting-to-a-catalog

I suggest getting that working and then replacing the catalog with your own.

Since you are using HMS, you should be using the Hive Catalog https://py.iceberg.apache.org/configuration/#hive-catalog or similarly

load_catalog(..., catalog_type= "hive")