Azure / azure-cosmosdb-spark

Apache Spark Connector for Azure Cosmos DB
MIT License
201 stars 120 forks source link

Add logic to handle EOF exception in Streaming checkpoint reads caused by the transient Write flush exception #435

Closed revinjchalil closed 3 years ago

revinjchalil commented 3 years ago

When WASB is used to store streaming checkpoint files, there rarely occurs an exception during flush() after the write of a valid token and during the close of the checkpoint file. In this case, the written token value is actually there, but until the block list is successfully flushed, it is not readable and we get the EOF exception for reads during this time.

This PR increases the retrycount of checkpoint file reads and introduces a short 100 millisecond sleep between retries if the above issue occurs in checkpoint Reads. We see that in most of the cases, the flush issue recovers in few retries and so this should ideally take care of the issue. If still not recoverable, will fallback to the backup tokens location for the next tokens location read and vice versa.