Azure / azure-kusto-spark

Apache Spark Connector for Azure Kusto
Apache License 2.0
77 stars 35 forks source link

Kusto Spark connector cannot connect to Fabric Kusto databases #325

Closed divyavanmahajan closed 1 year ago

divyavanmahajan commented 1 year ago

Describe the bug Fabric Kusto databases have the url "https://trd3bep6cbtfa821kx6hfa.z5.kusto.fabric.microsoft.com/" When using the Spark Connector for Azure Data Explorer with a Fabric KQL database

If cluster="trd3bep6cbtfa821kx6hfa.z5" , we get an error DataServiceException: IOError when trying to retrieve CloudInfo Caused by: UnknownHostException: trd3bep6cbtfa821kx6hfa.z5.kusto.windows.net: Name or service not known

The Spark driver assumes the suffix ".kusto.windows.net" and does not find the cluster.

If cluster="https://trd3bep6cbtfa821kx6hfa.z5.kusto.fabric.microsoft.com/" , we get the error Can't communicate with 'trd3bep6cbtfa821kx6hfa.z5.kusto.fabric.microsoft.com' as this hostname is currently not trusted; please see https://aka.ms/kustotrustedendpoints

To Reproduce

  1. Create a Fabric KQL database.
  2. Get the Cluster ID for the KQL database in Fabric. Create a PowerBI report. View the report in Lineage View. The Dataset's source is AzureDataExplorer with the URL of the Fabric KQL database.
  3. Use the cluster prefix in the spark options
    val df = spark.read.format("com.microsoft.kusto.spark.datasource").
    options(conf).
    option(KustoSourceOptions.KUSTO_QUERY, query).
    option(KustoSourceOptions.KUSTO_DATABASE, database).
    option(KustoSourceOptions.KUSTO_CLUSTER, cluster).
    load()

Expected behavior The Query should run and return data.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

ag-ramachandran commented 1 year ago

Hello @divyavanmahajan

You will have to use the version 4.0.x of the spark connector. You can refer our comments on this GH issue.

Maven coordinates not found in Databricks Install libraries - com.microsoft.azure.kusto:kusto-spark_3.0_2.12:4.0.2 · Issue #322 · Azure/azure-kusto-spark · GitHub

This was opened as well and the user marked this as fixed when he tested with it too. Unable to use connector if uri ends with kusto.fabric.microsoft.com · Issue #323 · Azure/azure-kusto-spark · GitHub

Why : This is because there are a set of "WellKnown" Kusto endpoints. *.fabric. was not one of them till versions 3.x series of the connector. This was done on versions 4 and up.

The current problem with 4.x of the connector is that it has to run JDK 11 (we are trying to mitigate it as well to make it JDK8 compat, as many customers complained of not being ready to migrate.) This is 1 week out at most

Note :

You have to set databricks to use JDK 11 as in the image below

image

use the env var: JNAME=zulu11-ca-amd64

ravikiransharvirala commented 1 year ago

Hi @ag-ramachandran, seeing this issue even after making above changes - All purpose compute.

But with jobs compute clusters it works as expected.

Config:

image

Library:

Tried with Maven coordinates and by uploading JAR

image

Cluster: image

ag-ramachandran commented 1 year ago

Hello @ravikiransharvirala If you have a full stack trace that would be great. Please note that we are also planning to release a JAVA8 compat for this sometime in the next week. That will make it simpler

Other questions include, was the compute restarted, post install of the jars? Just to eliminate another possibility. do you have any VNET/Firewall rules outbound?

ravikiransharvirala commented 1 year ago

@ag-ramachandran please find the stack trace

Failed to execute query. Error : Can't communicate with '<cluster-name>.z2.kusto.fabric.microsoft.com' as this hostname is currently not trusted; please see https://aka.ms/kustotrustedendpoints

Thank you, Yes, waiting for that update.

Tried restart, new cluster signup, post jar install execution . It didn’t work. I got the same error message.

Re: firewall, none that I know off.I don’t think so cause it works fine with databricks jobs.

ag-ramachandran commented 1 year ago

Hi @ravikiransharvirala and @divyavanmahajan

https://mvnrepository.com/artifact/com.microsoft.azure.kusto/kusto-spark_3.0_2.12/5.0.0 is a new version that is released that has JDK8 compat and works with Fabric as well. Please give it a try and let us know

cc: @asaharn

ravikiransharvirala commented 1 year ago

hi @ag-ramachandran,

Appreciate you for following up on this.

I updated my cluster to the latest version but now I see different error while trying to query ADX fabric cluster.

java.lang.NoClassDefFoundError: com/microsoft/aad/msal4j/ClientCredentialFactory `--------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) File :3 2 kusto_config = get_kusto_config() ----> 3 df = spark.read.format("com.microsoft.kusto.spark.datasource") \ 4 .option("kustoCluster", kusto_config['cluster_uri']) \ 5 .option("kustoDatabase", ) \ 6 .option("kustoQuery", '') \ 7 .option("kustoAadAppId", kusto_config['client_id']) \ 8 .option("kustoAadAppSecret", kusto_config['client_secret']) \ 9 .option("kustoAadAuthorityID", kusto_config['tenant_id']) \ 10 .load()

File /databricks/spark/python/pyspark/instrumentation_utils.py:48, in _wrap_function..wrapper(*args, *kwargs) 46 start = time.perf_counter() 47 try: ---> 48 res = func(args, **kwargs) 49 logger.log_success( 50 module_name, class_name, function_name, time.perf_counter() - start, signature 51 ) 52 return res

File /databricks/spark/python/pyspark/sql/readwriter.py:309, in DataFrameReader.load(self, path, format, schema, **options) 307 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path))) 308 else: --> 309 return self._df(self._jreader.load())

File /databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py:1321, in JavaMember.call(self, *args) 1315 command = proto.CALL_COMMAND_NAME +\ 1316 self.command_header +\ 1317 args_command +\ 1318 proto.END_COMMAND_PART 1320 answer = self.gateway_client.send_command(command) -> 1321 return_value = get_return_value( 1322 answer, self.gateway_client, self.target_id, self.name) 1324 for temp_arg in temp_args: 1325 temp_arg._detach()

File /databricks/spark/python/pyspark/errors/exceptions.py:228, in capture_sql_exception..deco(*a, kw) 226 def deco(*a: Any, *kw: Any) -> Any: 227 try: --> 228 return f(a, kw) 229 except Py4JJavaError as e: 230 converted = convert_exception(e.java_exception)

File /databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( 331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n". 332 format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o423.load. : java.lang.NoClassDefFoundError: com/microsoft/aad/msal4j/ClientCredentialFactory at com.microsoft.azure.kusto.data.auth.TokenProviderFactory.createTokenProvider(TokenProviderFactory.java:28) at com.microsoft.azure.kusto.data.ClientImpl.(ClientImpl.java:95) at com.microsoft.azure.kusto.data.ClientImpl.(ClientImpl.java:67) at com.microsoft.azure.kusto.data.ClientFactory.createClient(ClientFactory.java:38) at com.microsoft.azure.kusto.data.ClientFactory.createClient(ClientFactory.java:25) at com.microsoft.kusto.spark.utils.ExtendedKustoClient.engineClient$lzycompute(ExtendedKustoClient.scala:32) at com.microsoft.kusto.spark.utils.ExtendedKustoClient.engineClient(ExtendedKustoClient.scala:32) at com.microsoft.kusto.spark.utils.ExtendedKustoClient.$anonfun$executeEngine$1(ExtendedKustoClient.scala:394) at com.microsoft.kusto.spark.utils.KustoDataSourceUtils$$anon$2.apply(KustoDataSourceUtils.scala:398) at io.github.resilience4j.retry.Retry.lambda$decorateCheckedSupplier$3f69f149$1(Retry.java:137) at io.github.resilience4j.retry.Retry.executeCheckedSupplier(Retry.java:419) at com.microsoft.kusto.spark.utils.KustoDataSourceUtils$.retryApplyFunction(KustoDataSourceUtils.scala:401) at com.microsoft.kusto.spark.utils.ExtendedKustoClient.executeEngine(ExtendedKustoClient.scala:395) at com.microsoft.kusto.spark.utils.KustoDataSourceUtils$.getSchema(KustoDataSourceUtils.scala:175) at com.microsoft.kusto.spark.datasource.KustoRelation.getSchema(KustoRelation.scala:145) at com.microsoft.kusto.spark.datasource.KustoRelation.schema(KustoRelation.scala:43) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:491) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:378) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:334) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:334) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:306) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195) at py4j.ClientServerConnection.run(ClientServerConnection.java:115) at java.lang.Thread.run(Thread.java:750)`

asaharn commented 1 year ago

hI @ravikiransharvirala,

This is strange, somehow it is not able to find the class that is used by one of the dependency by Kusto spark connector.

To avoid this you can try out one of the below two:

  1. Import the Az identity explicitly.
  2. Try importing jar from the official github release.

Please let us know if this works for you.

ravikiransharvirala commented 1 year ago

@asaharn Sorry for the delay here. It didn't work. I uploaded the latest Jar to the cluster and tested it.

Failed to execute query. Error : Can't communicate with 'z2.kusto.fabric.microsoft.com' as this hostname is currently not trusted; please see https://aka.ms/kustotrustedendpoints

ag-ramachandran commented 1 year ago

@ravikiransharvirala please set up a working session by sending an email to ramacg at ms.