Data from subsequent batches are skipped after an BlobAlreadyReceived_BlobAlreadyFoundInBatch error

lgo-solytic commented 3 months ago

Describe the bug We are running two delta stream jobs using the spark connector both writing to the same azure data explorer table. These jobs by design do write ingest different timestamp ranges so writing the same data from both of them is extremely unlikely. On two occasions we had the BlobAlreadyReceived_BlobAlreadyFoundInBatch ingestion error timely close to one another. After this ingestion error, the spark job continued working in the way that no data was effectively committed to the azure data explorer table (delta batches completed with checkpoints) until it was restarted manually. One can see that in the Ingestion volume chart, with the drop in volume starting directly after the error (the ingestion volume is non-zero because of other jobs writing to other tables).

To Reproduce We are not sure if the mentioned errors were caused by one job or both pushing the same dataset simultaneously. It was not possible to simply reproduce by replaying this delta stream with both jobs running in parallel.

Expected behavior In this case, we would expect the job to crash or continue working normally as before, but not skip data from the subsequent batches. Or is my assumption incorrect and there is a way to handle this error?

Screenshots

1u7eRnnrsFsnCnhHXAe2JtD4pczjjFuQDo9Y3iE5RzkGor7UFyp3WGZOKAsuioO7XrlYQUwmmnRLY1ylkEh1FBQ4ocetIBlfnrkmz8yMopU7pNf8ka6zRzdVTTNi8nDhLmZayRF7TKBmKRNtvBC6OoA

iXwBDAJWwXELCUJpHxaNDxpVAYFyC2Op6cbmGm8pQEbJj_oMRzJR1n4DBBW09_XbYwqaUMDCvPg0q-zOVknWyoVvx-qMUautM_nwKrZAS__d6Sez_wF-AoFr1ZNPcNRVAeIVf-qEKt84aVHESg4kIlM

Desktop (please complete the following information): "kusto-spark_3.0" % "5.0.6"

Additional context

ag-ramachandran commented 3 months ago

@lgo-solytic , Please provide the write options you are using and the cluster name. Please provide the approx time of these runs as well

lgo-solytic commented 3 months ago

@ag-ramachandran,

Please provide the write options you are using and the cluster name. Please provide the approx time of these runs as well Options: KustoSinkOptions.KUSTO_TABLE_CREATE_OPTIONS -> "CreateIfNotExist", KustoSinkOptions.KUSTO_STAGING_RESOURCE_AUTO_CLEANUP_TIMEOUT -> "5", KustoSinkOptions.KUSTO_WRITE_ENABLE_ASYNC -> "true", KustoSinkOptions.KUSTO_WRITE_MODE -> "Queued"

Nothing additional in SparkIngestionProperties

Both jobs were running (streaming) for longer than 48h uninterrupted.

ag-ramachandran commented 3 months ago

@lgo-solytic : BlobAlreadyReceived_BlobAlreadyFoundInBatch is a Kusto warning from : https://learn.microsoft.com/en-us/azure/data-explorer/error-codes#category-blobalreadyreceived (it correlates to a similar case of 2 blobs getting ingested as well)

Is there any other Connector logs you see from this failure ? The logs start with KustoConnector for the Spark connector. If you find anything in that correlates, that'd be useful

There is no correlation inference we can draw between the drop in volumes related to this error. You need not use the option

_KustoSinkOptions.KUSTO_WRITE_ENABLEASYNC -> "true", , exceptions in tasks are not propagated to driver if this is used.

lgo-solytic commented 3 months ago

@ag-ramachandran thanks for your answer. No unfortunately no logs that could help. Is it possible to configure the KustoConnector to log to application insights?

The same thing happened again last night. But this time only one job was running so at least we can exclude the case of two jobs colliding.

Azure / azure-kusto-spark

Data from subsequent batches are skipped after an BlobAlreadyReceived_BlobAlreadyFoundInBatch error #380