StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
9.03k stars 1.82k forks source link

Add multi-level namespaces support in Iceberg rest catalog to allow direct Snowflake's Iceberg tables access #52451

Open marc-marketparts opened 3 weeks ago

marc-marketparts commented 3 weeks ago

Feature request

Is your feature request related to a problem? Please describe. Snowflake's internal Iceberg catalog can be sync automatically to their Snowflake Open Catalog service (polaris rest catalog), which could enable direct access to tables from Starrocks and therefore drop any effort to transfer data from Snowflake to Starrocks.

However, unlike Mysql (and Starrocks ?) where database and schema are the same concept, Snowflake tables are stored in a 2 levels database structure, meaning each table is stored in a schema object that belongs to a database object (https://docs.snowflake.com/en/sql-reference/ddl-database). Snowflake replicate this structure in its iceberg catalog through namespaces, so a table will always be available in the namespace "database.schema".

After having connected Starrocks to the Snowflake Open catalog successfully, I tried to access some tables but my attempts to select the right namespace or table through the commands USE database.schema or SELECT * FROM database.schema.table always fail.

So, Starrocks does not seem to handle multi-level namespaces.

Describe the solution you'd like Allow full path object identifier in sql commands to match namespaces structure as spark does https://iceberg.apache.org/docs/1.6.0/spark-configuration/#using-catalogs.

At least, enable users to set the right namespace through the USE command.

Describe alternatives you've considered Using an ETL alternative process to transfer data, that will require extra effort and cost and be error prone and inefficient for millions of rows.

DorianZheng commented 3 weeks ago

@marc-marketparts Hi, thanks for filing this issue. Could you try

use `database.schema`

or

select * from `database.schema`.table

We have also encountered this issue in Unity Catalog and typically treat the entire namespace level as database objects that should work.

marc-marketparts commented 3 weeks ago

Hi @DorianZheng

It does not work unfortunately, I get the error "Unknown database" for both.

I also tried with the official namespace separator %1F (use database%1Fschema) but it fails too. https://github.com/apache/iceberg/blob/6319712b612b724fedbc5bed41942ac3426ffe48/open-api/rest-catalog-open-api.yaml#L225

DorianZheng commented 3 weeks ago

Could you send me the fe.log

marc-marketparts commented 3 weeks ago

Here is the fe log fe.log

Queries are executed on mysql client after launching starrocks in local environment through "docker run -p 9030:9030 -p 8030:8030 -p 8040:8040 -itd starrocks/allin1-ubuntu:latest"

FYI, here is the postman result of GET {{catalog_uri}}/v1/SNOWFLAKE_PROD/namespaces/PROD_APP%1FPRSANA/tables

{
    "identifiers": [
        {
            "namespace": [
                "PROD_APP",
                "PRSANA"
            ],
            "name": "MY_TEST_TABLE_ICEBERG"
        }
    ],
    "next-page-token": null
}

The SQL "SHOW DATABASES" displays the first level of namespace only. I think it should also display recursively all levels, so we know what sub namespace are available, for example, "database1", "database2", "database1.schema1", "database1.schema2", etc., and then call USE command.

Smith-Cruise commented 1 week ago

Looks like we need to support it, but it's easier

DorianZheng commented 6 days ago

@marc-marketparts We have acknowledged this is an issue. We will fix it ASAP

marc-marketparts commented 6 days ago

@DorianZheng thx this will avoid extra etl processes.