Closed lgo-solytic closed 2 months ago
@lgo-solytic , Please provide the write options you are using and the cluster name. Please provide the approx time of these runs as well
@ag-ramachandran,
Please provide the write options you are using and the cluster name. Please provide the approx time of these runs as well Options: KustoSinkOptions.KUSTO_TABLE_CREATE_OPTIONS -> "CreateIfNotExist", KustoSinkOptions.KUSTO_STAGING_RESOURCE_AUTO_CLEANUP_TIMEOUT -> "5", KustoSinkOptions.KUSTO_WRITE_ENABLE_ASYNC -> "true", KustoSinkOptions.KUSTO_WRITE_MODE -> "Queued"
Nothing additional in SparkIngestionProperties
Both jobs were running (streaming) for longer than 48h uninterrupted.
@lgo-solytic : BlobAlreadyReceived_BlobAlreadyFoundInBatch is a Kusto warning from : https://learn.microsoft.com/en-us/azure/data-explorer/error-codes#category-blobalreadyreceived (it correlates to a similar case of 2 blobs getting ingested as well)
Is there any other Connector logs you see from this failure ? The logs start with KustoConnector for the Spark connector. If you find anything in that correlates, that'd be useful
There is no correlation inference we can draw between the drop in volumes related to this error. You need not use the option
_KustoSinkOptions.KUSTO_WRITE_ENABLEASYNC -> "true", , exceptions in tasks are not propagated to driver if this is used.
@ag-ramachandran thanks for your answer. No unfortunately no logs that could help. Is it possible to configure the KustoConnector to log to application insights?
The same thing happened again last night. But this time only one job was running so at least we can exclude the case of two jobs colliding.
Describe the bug We are running two delta stream jobs using the spark connector both writing to the same azure data explorer table. These jobs by design do write ingest different timestamp ranges so writing the same data from both of them is extremely unlikely. On two occasions we had the BlobAlreadyReceived_BlobAlreadyFoundInBatch ingestion error timely close to one another. After this ingestion error, the spark job continued working in the way that no data was effectively committed to the azure data explorer table (delta batches completed with checkpoints) until it was restarted manually. One can see that in the Ingestion volume chart, with the drop in volume starting directly after the error (the ingestion volume is non-zero because of other jobs writing to other tables).
To Reproduce We are not sure if the mentioned errors were caused by one job or both pushing the same dataset simultaneously. It was not possible to simply reproduce by replaying this delta stream with both jobs running in parallel.
Expected behavior In this case, we would expect the job to crash or continue working normally as before, but not skip data from the subsequent batches. Or is my assumption incorrect and there is a way to handle this error?
Screenshots
Desktop (please complete the following information): "kusto-spark_3.0" % "5.0.6"
Additional context