delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.21k stars 394 forks source link

Can't read a Delta table from Azure Unity Catalog #1628

Open MigQ2 opened 1 year ago

MigQ2 commented 1 year ago

Environment

Environment:


Bug

What happened:

I am trying to replicate this example from the documentation to read a Delta Table from Databricks Unity Catalog:

from deltalake import DataCatalog, DeltaTable
catalog_name = 'main'
schema_name = 'db_schema'
table_name = 'db_table'
data_catalog = DataCatalog.UNITY
dt = DeltaTable.from_data_catalog(data_catalog=data_catalog, data_catalog_id=catalog_name, database_name=schema_name, table_name=table_name)

but I get the following error:

OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 
retries: error sending request for url 
(http://<SOME-IP-ADDRESS>/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.
com): error trying to connect: tcp connect error: Connection refused (os error 111)

Stacktrace:

 /home/vscode/.local/lib/python3.10/site-packages/deltalake/table.py:285 in from_data_catalog     │
│                                                                                                  │
│   282 │   │   │   database_name=database_name,                                                   │
│   283 │   │   │   table_name=table_name,                                                         │
│   284 │   │   )                                                                                  │
│ ❱ 285 │   │   return cls(                                                                        │
│   286 │   │   │   table_uri=table_uri, version=version, log_buffer_size=log_buffer_size          │
│   287 │   │   )                                                                                  │
│   288                                                                                            │
│                                                                                                  │
│ /home/vscode/.local/lib/python3.10/site-packages/deltalake/table.py:246 in __init__              │
│                                                                                                  │
│   243 │   │                                                                                      │
│   244 │   │   """                                                                                │
│   245 │   │   self._storage_options = storage_options                                            │
│ ❱ 246 │   │   self._table = RawDeltaTable(                                                       │
│   247 │   │   │   str(table_uri),                                                                │
│   248 │   │   │   version=version,                                                               │
│   249 │   │   │   storage_options=storage_options, 

What you expected to happen:

I wish I could read the Delta Table

More details:

rtyler commented 1 year ago

I wish I could read the Delta Table

:laughing: me too

The Unity support in delta-rs is young I would say. I have access to a Unity environment but not an Azure specific Databricks+Unity environment. I'm not honestly sure how to start here, I assume the URL that was spit out to you is at a legitimate hostname that might otherwise respond to connections from wherever you are running this Python code?

r3stl355 commented 12 months ago

It looks like the current implementation works for storage location retrieval, but will require additional creds for data access. (in addition to Azure I also tried in AWS - similar story)

I suspect this could work if the application is running on a cloud VM with certain rights but I didn't test that (That <SOME-IP-ADDRESS> is 169.254.169.254, right? - that's a special IP usually used on cloud VMs to retrieve instance metadata information. Which gives a clue that credentials with sufficient rights are not available when code is trying to access data so it tries to obtain some via instance metadata).

In addition to being a metadata provider, Unity on Databricks also acts as an access token provider so it can enforce ACLs, etc. Using the same pattern on local/non-Databricks compute would provide a similar experience but I don't know if it's achievable at the moment(or will ever be).

A possible quick fix could be providing additional credentials that allow access to the storage managed by UC. For example, when I specify AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN in my environment vars for an AWS account which can read from that S3 location, it works on AWS. (Well, it resolves that error but I get The table's minimum reader version is 2 but deltalake only supports up to version 1 when I try to to_pyarrow_table but that's a different story).

I guess this workaround may also work in Azure with a right secret/key/token/...

r3stl355 commented 12 months ago

Actually, this looks like an expected behavior, mentioned in https://github.com/delta-io/delta-rs/pull/1331#issuecomment-1581557227

rtyler commented 12 months ago

@r3stl355 This is a topic I have recently discussed with @MrPowers and some of the Databricks team. I don't have a great solution to offer at the moment other than "we're working on figuring this out" :smile:

r3stl355 commented 12 months ago

@rtyler maybe you could include me in those future conversations given I work for Databricks atm :grin

ion-elgreco commented 12 months ago

It looks like the current implementation works for storage location retrieval, but will require additional creds for data access. (in addition to Azure I also tried in AWS - similar story)

I suspect this could work if the application is running on a cloud VM with certain rights but I didn't test that (That <SOME-IP-ADDRESS> is 169.254.169.254, right? - that's a special IP usually used on cloud VMs to retrieve instance metadata information. Which gives a clue that credentials with sufficient rights are not available when code is trying to access data so it tries to obtain some via instance metadata).

In addition to being a metadata provider, Unity on Databricks also acts as an access token provider so it can enforce ACLs, etc. Using the same pattern on local/non-Databricks compute would provide a similar experience but I don't know if it's achievable at the moment(or will ever be).

A possible quick fix could be providing additional credentials that allow access to the storage managed by UC. For example, when I specify AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN in my environment vars for an AWS account which can read from that S3 location, it works on AWS. (Well, it resolves that error but I get The table's minimum reader version is 2 but deltalake only supports up to version 1 when I try to to_pyarrow_table but that's a different story).

I guess this workaround may also work in Azure with a right secret/key/token/...

The Unity Catalog in my org is becoming a huge roadblock to use Delta-RS in a broad scope outside of internal team use. No one wants to provide read credentials anymore to the storage which obliterates the use of Delta-RS within this context. Besides the possible vendor lock-in 😄, it makes interoperability with databricks not ideal, currently for any data reads we revert back to databricks-sql connector.

davidvesp commented 11 months ago

I have the same problem:

OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 retries: error sending request for url (http://169.254.169.254/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com): error trying to connect: tcp connect error: Se ha intentado una operación de socket en una red no accesible. (os error 10051)

The 169.254.169.254 is used to retrieve the authentication token https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#get-a-token-using-http

But I don't understand why this is needed, as the Databricks documentation says we need to get a short-lived token and a signed URL: image

ion-elgreco commented 11 months ago

I have the same problem:

OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 retries: error sending request for url (http://169.254.169.254/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com): error trying to connect: tcp connect error: Se ha intentado una operación de socket en una red no accesible. (os error 10051)

The 169.254.169.254 is used to retrieve the authentication token https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#get-a-token-using-http

But I don't understand why this is needed, as the Databricks documentation says we need to get a short-lived token and a signed URL: image

Interesting, so UC by design gives a token to read the data from storage. Then this token should just be returned when you query databricks REST APIs get table