Azure / azure-kusto-spark

Apache Spark Connector for Azure Kusto
Apache License 2.0
77 stars 34 forks source link

Write to Kusto in Synapse with option "sparkIngestionPropertiesJson" always failed in spark 3.3 #342

Closed xiaoshiyi123 closed 10 months ago

xiaoshiyi123 commented 10 months ago

Describe the bug Hi Team We want to update the Spark pool from 3.2 to 3.3. But when we use "sparkIngestionPropertiesJson" to write to Kusto, the spark job will not stop or fail, running for a long time.

To Reproduce Steps to reproduce the behavior:

  1. Create a spark pool with spark 3.3
  2. Create a notebook and use this spark pool
  3. Read a df and write to Kusto with option "sparkIngestionPropertiesJson"

Expected behavior Write to Kusto successfully.

Screenshots use sp image image

do not use sp can success image

Desktop (please complete the following information): image

Additional context Add any other context about the problem here.

ag-ramachandran commented 10 months ago

Hi @xiaoshiyi123

a) There are a couple of challenges here, by using SparkIngestionPropertiesJson, we are using the FlushImmediatelt flag to true, which we do not recommend.

Here is how the internals work

DataFrame ----> Write to blob ------> Ingest this blob

To optimize for throughput in ingestion, the size of the blob is critical. Kusto is optimized for few large blobs , as opposed to many small blobs. Please use the right batching policy from Kusto and you can get rid of SparkIngestionProperties altogether

(Refer : https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/batchingpolicy)

b) You can try and use the Queued writeMode, IIRC right the version in Synapse had an issue that where the shards that were to be merged were queried incorrectly (so probably that could be a cause too) If you still want to use the FlushImmediately flag still (Not recommended, will result in no aggregation and many smaller ingestion, please use the queued write option

.option("writeMode","Queued")

Refer : https://github.com/Azure/azure-kusto-spark/blob/master/docs/KustoSink.md writeMode

xiaoshiyi123 commented 10 months ago

Thanks @ag-ramachandran .option("writeMode","Queued") solved my problem. Thanks for your kind answer and suggestions.