databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

com.databricks.spark.xml Could not find ADLS Gen2 Token #591

Closed betizad closed 1 year ago

betizad commented 2 years ago

I access ADLS G2 files from databricks using the following cluster configuration, and through service principle, recommended by databricks documentation. The idea is to run the notebook as a Service principle with AAD pass through.

spark.databricks.delta.preview.enabled true
spark.databricks.passthrough.enabled true
spark.databricks.repl.allowedLanguages python,sql
spark.databricks.cluster.profile serverless
spark.databricks.pyspark.enableProcessIsolation true
fs.azure.account.oauth2.client.id.<StorageAccountName>.dfs.core.windows.net {{secrets/<SecretScope>/<ClientSecretName>}}
fs.azure.account.oauth2.client.endpoint.<StorageAccountName>.dfs.core.windows.net https://login.microsoftonline.com/<TenantId>/oauth2/token
fs.azure.account.oauth2.client.secret.<StorageAccountName>.dfs.core.windows.net {{secrets/<SecretScope>/<SecretName>}}
fs.azure.account.auth.type.<StorageAccountName>.dfs.core.windows.net OAuth
fs.azure.account.oauth.provider.type.<StorageAccountName>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider

All works well, for example reading parquet files or listing files with dbutils.fs.ls(), but when I try to read an xml file I get the following error: com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen2 Token

I even tried setting the config in the notebook as blow (based on issue 453 https://github.com/databricks/spark-xml/issues/453)

spark.conf.set("spark.hadoop.fs.azure.account.oauth2.client.id.<StorageAccountName>.dfs.core.windows.net", spClientId)
spark.conf.set("spark.hadoop.fs.azure.account.oauth2.client.endpoint.<StorageAccountName>.dfs.core.windows.net", "https://login.microsoftonline.com/<TenantId>/oauth2/token")
spark.conf.set("spark.hadoop.fs.azure.account.oauth2.client.secret.<StorageAccountName>.dfs.core.windows.net", spSecret)
spark.conf.set("spark.hadoop.fs.azure.account.auth.type.<StorageAccountName>.dfs.core.windows.net", "OAuth")
spark.conf.set("spark.hadoop.fs.azure.account.oauth.provider.type.<StorageAccountName>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") 

but the result is the same and the spark session does not seem to find the token to authenticate with ADLS G2

I'm a bit puzzled, since, If I start the same cluster and read the xml file through my account, it works fine, and I don't get any "token not found" error.

srowen commented 2 years ago

This I am not sure about. Under the hood it is using the same mechanisms that read text files via Hadoop APIs. I don't know what that means does/doesn't work about setting configs where. This much is a bit outside the library's scope.

betizad commented 2 years ago

I could not find any way around the issue. Any suggestions are welcome. As a temporary solution, I copy the file in a temp location in the workspace, manage the operations, save the results, and move it back to the storage account with databricks.fs.mv which manages to find the token.

aspandjev commented 1 year ago

@betizad Did you find a solution for this one? I'm trying the same, reading a xml from adls gen2 and it doesn't work. I've tried all authentication methods and it always returns the 'Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key' even if it's set. I'm doing spark.conf.set("fs.azure.account.key.storageaccountname.dfs.core.windows.net", "secret") and it still doesn't find it. I am using the abfss method and dbutils.fs.ls works. I've tested it with reading a csv file, and it works using a sas token and abfss.

betizad commented 1 year ago

No, I did not. The issue is only when using this XML driver. I use the trick mentioned baove to get around the issue. It's not ideal but at the moment is the only solution I could find.

Brontomerus commented 1 year ago

It seems user Deepak on Stack Overflow was able to solve for this. I attempted in an environment and it appears to work for spark-xml as well. Check out the question here. I believe the issue is you need to attach the creds to the current spark.hadoop context and it works.


spark._jsc.hadoopConfiguration().set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
menwillbemen commented 1 year ago

We can simply add spark.hadoop prefix to these properties and add them to the spark configuration and it will work. Setting spark._jsc.haddopConfiguration is just an internal implementation.