Azure / azure-cosmosdb-spark

Apache Spark Connector for Azure Cosmos DB
MIT License
201 stars 120 forks source link

Adding BulkSink for streaming writes #441

Open FabianMeiswinkel opened 3 years ago

FabianMeiswinkel commented 3 years ago

Customer is facing some latency issues because their streaming workload at steady-state is small (like about 4 documents/second) but during some periods of the day can goo to tens -of-thousand of documents - when the AsyncConnection based write stream implementation doesn't work fast and robust enough. An initial attempt to use readStream.forEachBatch and then write each micro batch to cosmos via batch write works but is showing higher latency for the small stead-state workload. From my own tests the latency with the BulkSink can be improved by 200-300ms (still about 200ms slower than point writes via Async Connection but that is about the expected ballpark).