delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.32k stars 406 forks source link

Support for Apache Hive Data Catalog #1006

Open chitralverma opened 1 year ago

chitralverma commented 1 year ago

Description

It will be great of there was support to read delta tables registered in Hive catalog. Apache spark also supports this. Currently, AFAIK, only AWS glue catalog is supported.

Use Case Instead of providing path to the delta table, users should be able to provide Hive as a data catalog to read a delta table.

For Example:

from deltalake import DeltaTable
from deltalake import DataCatalog

database_name = "simple_database"
table_name = "simple_table"

data_catalog_opts = {"hive_user": "USER_NAME" ... }
data_catalog = DataCatalog.Hive(data_catalog_opts)

dt = DeltaTable.from_data_catalog(data_catalog=data_catalog, database_name=database_name, table_name=table_name)

...

Related Issue(s)

chitralverma commented 1 year ago

Hi @houqp , I see that you have added a rust label to this.

I think for this to be implemented in rust, a thrift client will have to be written specifically for this task because nothing exists in the crates to connect to Hive using rust unlike the rusoto_glue thats available for Glue.

If we do this on python side instead, there is PyHive available with DB-API 2.x support.

houqp commented 1 year ago

we could do it in python as a temporary workaround, but the right thing to do is to implement it in rust. a thrift hive client shouldn't be too much work to implement.

chitralverma commented 1 year ago

we could do it in python as a temporary workaround, but the right thing to do is to implement it in rust. a thrift hive client shouldn't be too much work to implement.

alright, let me check it out. my rust is not very good yet. :D

houqp commented 1 year ago

I think hiveserver2 also supports odbc? If so, that might be an easier route for us.

chitralverma commented 1 year ago

I think hiveserver2 also supports odbc? If so, that might be an easier route for us.

no actually, odbc was a part of Hortonworks (HDP) for very old hiveserver1. it was removed in hiveserver 2 in favour of thrift

See https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=27362101#content/view/27362099