Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.23k stars 152 forks source link

document how to pass onelake credential #2427

Closed djouallah closed 3 months ago

djouallah commented 4 months ago

it is not clear from the documentation how to pass onelake credential, can you have a look please

https://colab.research.google.com/drive/1c8nlVtqeG9upnSHOPx7TYjCWGJUk96Kp#scrollTo=73d43e23

kevinzwang commented 4 months ago

Hi Mim! We currently don't support Microsoft Fabric or using storage options as IO config, but we are looking into adding that support as well as exposing a public API for conversion between storage options and daft.io.IOConfig. Will get back to you in the next few days!

kevinzwang commented 3 months ago

A quick update: the PR in #2436 should unblock your use case, just make sure to store your bearer token in an IOConfig object and pass that in instead of a storage options dict.

Also, I created an issue #2435 to track our progress on adding support for using storage options similar to Polars, which would make this process even easier!

kevinzwang commented 3 months ago

@djouallah we just cut v0.2.29 which has the bearer_token parameter in daft.io.AzureConfig! Let me know if there's any issues with using it, or if not, feel free to close this issue

djouallah commented 3 months ago

I will check it when I go home, let's hope it will figure out one lake endpoints url as it is a little bit different from standard ADLS Gen2

djouallah commented 3 months ago

still errors,

DaftCoreException: DaftError::External Unable to open file abfss://[mim_test@onelake.dfs.fabric.microsoft.com](mailto:mim_test@daily-onelake.dfs.fabric.microsoft.com)/data.Lakehouse/Tables/scada/year=2024/part-00000-09ab865f-2062-43ef-8351-42036699fdef.c000.snappy.parquet: Error { context: Full(Custom { kind: Io, error: Error { context: Full(Custom { kind: Io, error: reqwest::Error { kind: Request, url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("onelake.blob.core.windows.net")), port: None, path: "/mim_test/data.Lakehouse/Tables/scada/year=2024/part-00000-09ab865f-2062-43ef-8351-42036699fdef.c000.snappy.parquet", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError("dns error", Custom { kind: Uncategorized, error: "failed to lookup address information: Name or service not known" })) } }, "failed to executereqwestrequest") } }, "retry policy expired and the request will no longer be retried") }

kevinzwang commented 3 months ago

Ah, sorry to hear that. Will take a look soon

kevinzwang commented 3 months ago

@djouallah could you try setting the endpoint_url in daft.io.AzureConfig to "https://onelake.blob.fabric.microsoft.com"? I'm not sure if that will work but I think one of the issues at least is that there are some incorrect assumptions about the cloud location being made, which can currently be overridden by endpoint_url

djouallah commented 3 months ago

@kevinzwang you are genius !!! it works

image
kevinzwang commented 3 months ago

Great to hear! We'll probably add a parameter similar to use_fabric_endpoint in the storage config to automatically set this for the user

jaychia commented 3 months ago

Omg nice 🔥 🔥 🔥

Should we make a little guide for Azure? Here's the current one: https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/microsoft-azure.html

@djouallah I feel like we're probably not representing the credentials story well for Azure in that document. Would love if you made some suggestions (or even maybe a contribution? https://github.com/Eventual-Inc/Daft/blob/main/docs/source/user_guide/integrations/microsoft-azure.rst)