delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.36k stars 416 forks source link

Can't read a Delta table from Azure Unity Catalog #1628

Open MigQ2 opened 1 year ago

MigQ2 commented 1 year ago

Environment

Environment:


Bug

What happened:

I am trying to replicate this example from the documentation to read a Delta Table from Databricks Unity Catalog:

from deltalake import DataCatalog, DeltaTable
catalog_name = 'main'
schema_name = 'db_schema'
table_name = 'db_table'
data_catalog = DataCatalog.UNITY
dt = DeltaTable.from_data_catalog(data_catalog=data_catalog, data_catalog_id=catalog_name, database_name=schema_name, table_name=table_name)

but I get the following error:

OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 
retries: error sending request for url 
(http://<SOME-IP-ADDRESS>/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.
com): error trying to connect: tcp connect error: Connection refused (os error 111)

Stacktrace:

 /home/vscode/.local/lib/python3.10/site-packages/deltalake/table.py:285 in from_data_catalog     │
│                                                                                                  │
│   282 │   │   │   database_name=database_name,                                                   │
│   283 │   │   │   table_name=table_name,                                                         │
│   284 │   │   )                                                                                  │
│ ❱ 285 │   │   return cls(                                                                        │
│   286 │   │   │   table_uri=table_uri, version=version, log_buffer_size=log_buffer_size          │
│   287 │   │   )                                                                                  │
│   288                                                                                            │
│                                                                                                  │
│ /home/vscode/.local/lib/python3.10/site-packages/deltalake/table.py:246 in __init__              │
│                                                                                                  │
│   243 │   │                                                                                      │
│   244 │   │   """                                                                                │
│   245 │   │   self._storage_options = storage_options                                            │
│ ❱ 246 │   │   self._table = RawDeltaTable(                                                       │
│   247 │   │   │   str(table_uri),                                                                │
│   248 │   │   │   version=version,                                                               │
│   249 │   │   │   storage_options=storage_options, 

What you expected to happen:

I wish I could read the Delta Table

More details:

rtyler commented 1 year ago

I wish I could read the Delta Table

:laughing: me too

The Unity support in delta-rs is young I would say. I have access to a Unity environment but not an Azure specific Databricks+Unity environment. I'm not honestly sure how to start here, I assume the URL that was spit out to you is at a legitimate hostname that might otherwise respond to connections from wherever you are running this Python code?

r3stl355 commented 1 year ago

It looks like the current implementation works for storage location retrieval, but will require additional creds for data access. (in addition to Azure I also tried in AWS - similar story)

I suspect this could work if the application is running on a cloud VM with certain rights but I didn't test that (That <SOME-IP-ADDRESS> is 169.254.169.254, right? - that's a special IP usually used on cloud VMs to retrieve instance metadata information. Which gives a clue that credentials with sufficient rights are not available when code is trying to access data so it tries to obtain some via instance metadata).

In addition to being a metadata provider, Unity on Databricks also acts as an access token provider so it can enforce ACLs, etc. Using the same pattern on local/non-Databricks compute would provide a similar experience but I don't know if it's achievable at the moment(or will ever be).

A possible quick fix could be providing additional credentials that allow access to the storage managed by UC. For example, when I specify AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN in my environment vars for an AWS account which can read from that S3 location, it works on AWS. (Well, it resolves that error but I get The table's minimum reader version is 2 but deltalake only supports up to version 1 when I try to to_pyarrow_table but that's a different story).

I guess this workaround may also work in Azure with a right secret/key/token/...

r3stl355 commented 1 year ago

Actually, this looks like an expected behavior, mentioned in https://github.com/delta-io/delta-rs/pull/1331#issuecomment-1581557227

rtyler commented 1 year ago

@r3stl355 This is a topic I have recently discussed with @MrPowers and some of the Databricks team. I don't have a great solution to offer at the moment other than "we're working on figuring this out" :smile:

r3stl355 commented 1 year ago

@rtyler maybe you could include me in those future conversations given I work for Databricks atm :grin

ion-elgreco commented 1 year ago

It looks like the current implementation works for storage location retrieval, but will require additional creds for data access. (in addition to Azure I also tried in AWS - similar story)

I suspect this could work if the application is running on a cloud VM with certain rights but I didn't test that (That <SOME-IP-ADDRESS> is 169.254.169.254, right? - that's a special IP usually used on cloud VMs to retrieve instance metadata information. Which gives a clue that credentials with sufficient rights are not available when code is trying to access data so it tries to obtain some via instance metadata).

In addition to being a metadata provider, Unity on Databricks also acts as an access token provider so it can enforce ACLs, etc. Using the same pattern on local/non-Databricks compute would provide a similar experience but I don't know if it's achievable at the moment(or will ever be).

A possible quick fix could be providing additional credentials that allow access to the storage managed by UC. For example, when I specify AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN in my environment vars for an AWS account which can read from that S3 location, it works on AWS. (Well, it resolves that error but I get The table's minimum reader version is 2 but deltalake only supports up to version 1 when I try to to_pyarrow_table but that's a different story).

I guess this workaround may also work in Azure with a right secret/key/token/...

The Unity Catalog in my org is becoming a huge roadblock to use Delta-RS in a broad scope outside of internal team use. No one wants to provide read credentials anymore to the storage which obliterates the use of Delta-RS within this context. Besides the possible vendor lock-in 😄, it makes interoperability with databricks not ideal, currently for any data reads we revert back to databricks-sql connector.

davidvesp commented 1 year ago

I have the same problem:

OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 retries: error sending request for url (http://169.254.169.254/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com): error trying to connect: tcp connect error: Se ha intentado una operación de socket en una red no accesible. (os error 10051)

The 169.254.169.254 is used to retrieve the authentication token https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#get-a-token-using-http

But I don't understand why this is needed, as the Databricks documentation says we need to get a short-lived token and a signed URL: image

ion-elgreco commented 1 year ago

I have the same problem:

OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 retries: error sending request for url (http://169.254.169.254/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com): error trying to connect: tcp connect error: Se ha intentado una operación de socket en una red no accesible. (os error 10051)

The 169.254.169.254 is used to retrieve the authentication token https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#get-a-token-using-http

But I don't understand why this is needed, as the Databricks documentation says we need to get a short-lived token and a signed URL: image

Interesting, so UC by design gives a token to read the data from storage. Then this token should just be returned when you query databricks REST APIs get table

tunayokumus commented 2 weeks ago

Hi @ion-elgreco, is this a good time to address this again, now that Unity Catalog OSS version 0.2.0 is released with credential vending support? Does this make it easier/clearer to implement?

https://github.com/unitycatalog/unitycatalog/releases/tag/v0.2.0

ion-elgreco commented 2 weeks ago

Hi @ion-elgreco, is this a good time to address this again, now that Unity Catalog OSS version 0.2.0 is released with credential vending support? Does this make it easier/clearer to implement?

https://github.com/unitycatalog/unitycatalog/releases/tag/v0.2.0

Sure, feel free to take a jab at it