Azure / azure-kusto-spark

Apache Spark Connector for Azure Kusto
Apache License 2.0
77 stars 35 forks source link

Kusto library not working when we enabled private end-point on Databricks. Issue is <>.blob.core.windows.net Name or service not known #385

Closed RaghuVarmaS closed 4 months ago

RaghuVarmaS commented 4 months ago

We have a databrciks instance where private end point is enabled. We are using com.microsoft.azure.kusto:kusto-spark_3.0_2.12:5.0.0 library in darabrciks cluster to connect to Kusto from databrciks and get the data. The issue is that,

  1. queries are not running when we use a ForceDistributedMode
  2. small data queries are running with out a ForcedDistributedMode
  3. Huge data queries are not running with out a ForcedDistributedMode. When looked at the logs , it is automatically using ForcedDistributedMode for parallelism.

We are getting an issue showing <>.blob.core.windows.net Name or service not known

image

The above query works without a ForcedDistributedMode.

image

When we use ForcedDistributedMode it is showing <>.blob.core.windows.net

image

We are not using forced distributed mode and still facing <>.blob.core.windows.net as the size of the data it has to return is more and automatically going into a parallel mode.

Not sure how do spark connector works b/w different modes.

RaghuVarmaS commented 4 months ago

Do we know how do spark connectore utilizes the intermediary blob storage ?

ag-ramachandran commented 4 months ago

Hello @RaghuVarmaS

Read from connector happens in 2 modes :

SingleMode : Direct query and get the results

ForceDistributedMode : When you hit Kusto limits (memory / number of records), data is exported to Blob storages (export containers) and then the exported data is read and processed into a Dataframe. This requires that write to the blob be permitted which was the case earlier when permitted. This is automatically done by the connector. It checks for appoximate row count or the time that the query takes and switches based on that

Ref : [https://github.com/Azure/azure-kusto-spark/blob/master/docs/KustoSource.md](Refer ReadMode).

Query Limits : https://learn.microsoft.com/en-us/azure/data-explorer/kusto/concepts/query-limits

Now, to solutions

In the OSS connector you can pass your own storage for this as well. This is done on the connector using TransientStorageParameters ([https://github.com/Azure/azure-kusto-spark/blob/master/docs/KustoSource.md#transient-storage-parameters]

If you are on a Subnet with PE , you will have to manage the DNS and whitelist the storages (You can see these storages used in Azure portal)

Screenshot from 2024-07-06 07-14-29 (1)

If we are sure that we do not want the connector to use the readMode as ForceDistributed, you will have to pass this option of ForceSingleMode explicitly, performance and load on the Kusto cluster may get affected due to this (Not Recommended)

RaghuVarmaS commented 4 months ago

Thank you for the prompt reply on this.I had a follow-up questios1. Due to security restrictions we are refrained from using the storage keys 2. Is kusto connector , creates the storage on the fly ? If so where do it create it. I tried searching for the storage in my subscriptions and I couldn't find this.3. How do I configured the DNS to ensure the blob works.On Jul 6, 2024 7:22 AM, Ramachandran A G @.***> wrote: Hello @RaghuVarmaS Read from connector happens in 2 modes : SingleMode : Direct query and get the results ForceDistributedMode : When you hit Kusto limits (memory / number of records), data is exported to Blob storages (export containers) and then the exported data is read and processed into a Dataframe. This requires that write to the blob be permitted which was the case earlier when permitted. These storages are visible in the portal You can read up on this at : [https://github.com/Azure/azure-kusto-spark/blob/master/docs/KustoSource.md](Refer ReadMode). The connector will switch to this mode when it detects that a query is hitting limits or is taking too long to respond Query Limits : https://learn.microsoft.com/en-us/azure/data-explorer/kusto/concepts/query-limits Now, to solutions In the OSS connector you can pass your own storage for this as well. This is done on the connector using TransientStorageParameters ([https://github.com/Azure/azure-kusto-spark/blob/master/docs/KustoSource.md#transient-storage-parameters] If you are on a Subnet with PE , you will have to manage the DNS and whitelist the storages (You can see these storages used in Azure portal) Screenshot.from.2024-07-06.07-14-29.1.png (view on web) If we are sure that we do not want the connector to use the readMode as ForceDistributed, you will have to pass this option of ForceSingleMode explicitly, performance and load on the Kusto cluster may get affected due to this (Not Recommended)

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

ag-ramachandran commented 4 months ago
  1. Due to security restrictions we are refrained from using the storage keys : OK, that is a org policy.

  2. Is kusto connector , creates the storage on the fly ? If so where do it create it. I tried searching for the storage in my subscriptions and I couldn't find this.

Not sure you had a chance to look at the screenshot in the above answer. You can see the storage on the DNS configuration in screenshot above. Read up here. These are managed storages that you will not find on your subscriptions.

  1. How do I configured the DNS to ensure the blob works.

You will have to ask a network administrator to work with you on this who can help you with networking of ADB and Kusto in PE. We have other customers who use this set up so, the right networking would work (but it is not in the scope of the connector)

RaghuVarmaS commented 4 months ago

I understand that these are managed storages. My question is who creates this managed storage. Is it by connector, Is it by Kusto ? It is very important that when i configure the DNS as stated by the image you have pasted, my storage location doesn't keep changing. today i am getting an error "Unknow host x.blob.core.windows.net", tomorrow i might get an issue like y.blob.core.windows.net ?

RaghuVarmaS commented 4 months ago

I understand that these are managed storages. My question is who creates this managed storage. Is it by connector, Is it by Kusto ? It is very important that when i configure the DNS as stated by the image you have pasted, my storage location doesn't keep changing. today i am getting an error "Unknow host x.blob.core.windows.net", tomorrow i might get an issue like y.blob.core.windows.net ?

I have even remove the private end-points now and tried. We were getting the same error again. What we do not able to troubleshoot is that how can we establish a connection b/w the intermediary storage and databrciks.

RaghuVarmaS commented 4 months ago

@ag-ramachandran - could you check my comments above and help respond. We have added A record to our private link and it worked. But my worry is that the ipaddresses might change tomorrow and it might not work.