Open MigQ2 opened 1 year ago
I wish I could read the Delta Table
:laughing: me too
The Unity support in delta-rs is young I would say. I have access to a Unity environment but not an Azure specific Databricks+Unity environment. I'm not honestly sure how to start here, I assume the URL that was spit out to you is at a legitimate hostname that might otherwise respond to connections from wherever you are running this Python code?
It looks like the current implementation works for storage location retrieval, but will require additional creds for data access. (in addition to Azure I also tried in AWS - similar story)
I suspect this could work if the application is running on a cloud VM with certain rights but I didn't test that (That <SOME-IP-ADDRESS>
is 169.254.169.254
, right? - that's a special IP usually used on cloud VMs to retrieve instance metadata information. Which gives a clue that credentials with sufficient rights are not available when code is trying to access data so it tries to obtain some via instance metadata).
In addition to being a metadata provider, Unity on Databricks also acts as an access token provider so it can enforce ACLs, etc. Using the same pattern on local/non-Databricks compute would provide a similar experience but I don't know if it's achievable at the moment(or will ever be).
A possible quick fix could be providing additional credentials that allow access to the storage managed by UC. For example, when I specify AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
and AWS_SESSION_TOKEN
in my environment vars for an AWS account which can read from that S3 location, it works on AWS. (Well, it resolves that error but I get The table's minimum reader version is 2 but deltalake only supports up to version 1
when I try to to_pyarrow_table
but that's a different story).
I guess this workaround may also work in Azure with a right secret/key/token/...
Actually, this looks like an expected behavior, mentioned in https://github.com/delta-io/delta-rs/pull/1331#issuecomment-1581557227
@r3stl355 This is a topic I have recently discussed with @MrPowers and some of the Databricks team. I don't have a great solution to offer at the moment other than "we're working on figuring this out" :smile:
@rtyler maybe you could include me in those future conversations given I work for Databricks atm :grin
It looks like the current implementation works for storage location retrieval, but will require additional creds for data access. (in addition to Azure I also tried in AWS - similar story)
I suspect this could work if the application is running on a cloud VM with certain rights but I didn't test that (That
<SOME-IP-ADDRESS>
is169.254.169.254
, right? - that's a special IP usually used on cloud VMs to retrieve instance metadata information. Which gives a clue that credentials with sufficient rights are not available when code is trying to access data so it tries to obtain some via instance metadata).In addition to being a metadata provider, Unity on Databricks also acts as an access token provider so it can enforce ACLs, etc. Using the same pattern on local/non-Databricks compute would provide a similar experience but I don't know if it's achievable at the moment(or will ever be).
A possible quick fix could be providing additional credentials that allow access to the storage managed by UC. For example, when I specify
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
andAWS_SESSION_TOKEN
in my environment vars for an AWS account which can read from that S3 location, it works on AWS. (Well, it resolves that error but I getThe table's minimum reader version is 2 but deltalake only supports up to version 1
when I try toto_pyarrow_table
but that's a different story).I guess this workaround may also work in Azure with a right secret/key/token/...
The Unity Catalog in my org is becoming a huge roadblock to use Delta-RS in a broad scope outside of internal team use. No one wants to provide read credentials anymore to the storage which obliterates the use of Delta-RS within this context. Besides the possible vendor lock-in 😄, it makes interoperability with databricks not ideal, currently for any data reads we revert back to databricks-sql connector.
I have the same problem:
OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 retries: error sending request for url (http://169.254.169.254/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com): error trying to connect: tcp connect error: Se ha intentado una operación de socket en una red no accesible. (os error 10051)
The 169.254.169.254 is used to retrieve the authentication token https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#get-a-token-using-http
But I don't understand why this is needed, as the Databricks documentation says we need to get a short-lived token and a signed URL:
I have the same problem:
OSError: Generic MicrosoftAzure error: Error performing token request: response error "request error", after 10 retries: error sending request for url (http://169.254.169.254/metadata/identity/oauth2/token?api-version=2019-08-01&resource=https%3A%2F%2Fstorage.azure.com): error trying to connect: tcp connect error: Se ha intentado una operación de socket en una red no accesible. (os error 10051)
The 169.254.169.254 is used to retrieve the authentication token https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#get-a-token-using-http
But I don't understand why this is needed, as the Databricks documentation says we need to get a short-lived token and a signed URL:
Interesting, so UC by design gives a token to read the data from storage. Then this token should just be returned when you query databricks REST APIs get table
Hi @ion-elgreco, is this a good time to address this again, now that Unity Catalog OSS version 0.2.0 is released with credential vending support? Does this make it easier/clearer to implement?
https://github.com/unitycatalog/unitycatalog/releases/tag/v0.2.0
Hi @ion-elgreco, is this a good time to address this again, now that Unity Catalog OSS version 0.2.0 is released with credential vending support? Does this make it easier/clearer to implement?
https://github.com/unitycatalog/unitycatalog/releases/tag/v0.2.0
Sure, feel free to take a jab at it
Environment
Environment:
Bug
What happened:
I am trying to replicate this example from the documentation to read a Delta Table from Databricks Unity Catalog:
but I get the following error:
Stacktrace:
What you expected to happen:
I wish I could read the Delta Table
More details: