Azure / azure-kusto-spark

Apache Spark Connector for Azure Kusto
Apache License 2.0
77 stars 34 forks source link

UnknownHostException for created storage account #391

Closed ChingChuan-Chen closed 1 month ago

ChingChuan-Chen commented 2 months ago

Describe the bug As my understanding, when data is large, Kusto Spark Connector will create a temporary storage to write csv file. But in my case, it somehow can't reach the created storage account. It will raise the exception: reactor.core.Exceptions$ReactiveException: java.net.UnknownHostException: *****.blob.core.windows.net: Name or service not known

To Reproduce I am not sure the reason it happened. The Kusto cluster is in other subscription which enables only allowing VPN to connect to. And our Synapse is in a private virtual network.

Expected behavior It should be okay to read the data through Kusto Spark Connector.

Desktop (please complete the following information):

Additional context Spark 3.4 with "com.microsoft.azure.kusto" %% "kusto-spark_3.0" % "5.0.8".

ag-ramachandran commented 2 months ago

Please refer : https://github.com/Azure/azure-kusto-spark/issues/385 and https://github.com/Azure/azure-kusto-spark/issues/390 and see if these provide pointers.

ChingChuan-Chen commented 1 month ago

Thank you. It seems that I can do nothing if I would like to continues using this library. Because the Kusto cluster is not under our management, the storage account is blocked by network rules. Also, the SAS token is forbidden to use by the compliance, so the transient storage is not working for us as well.

I think that I need to re-invent the wheel to partition the data with hash in the query by myself and read into Spark.

ag-ramachandran commented 1 month ago

Hello @ChingChuan-Chen, Thanks for the comment. So if I understand it right you want to bring in your own storage for ingestion ? Is that a fair understanding ?

There are a specific set of storages that are used per Kusto cluster ( they do not change in the lifetime of the cluster ), I can provide a walkthrough of it if needed.

Thank you. It seems that I can do nothing if I would like to continues using this library. Unfortunately, this is the way the connector was designed, a lot of customers who use the lib do not want seperate storage (except in case of read where they want to reuse data) , ingestion should work as-is The library just uses the Queued Ingestion where it uses the blob to ingest data.

If you have a question , happy to take it forward . you can reach me at ramacg at ms dot com