Azure / azure-cosmosdb-spark

Apache Spark Connector for Azure Cosmos DB
MIT License
199 stars 119 forks source link

spark.cosmos.changeFeed.itemCountPerTriggerHint not honoured #483

Closed roadruuner closed 1 year ago

roadruuner commented 1 year ago

I am using Spark Structured Streaming with asuze-cosmos-spark_3-1_2-2 % "4.16.0"

I need to cap how much data I can read. Tried setting the following: spark.cosmos.changeFeed.itemCountPerTriggerHint = 10 spark.cosmos.read.maxItem = 10

spark.cosmos.changeFeed.startFrom = "Beginning" spark.cosmos.changeFeed.mode = "Incremental"

Using val changeFeedDF = spark.readStream .schema(customSchema) .format("cosmos.oltp.changeFeed") .options(readConfig) .load

FabianMeiswinkel commented 1 year ago

Hi, itemCountPerTriggerHint will allow you to modify the max. memory/resource consumption per micro batch. it is only a hint because change feed in Cosmos DB will always include at least all documents of a single atomic transaction (all sharing the same LSN - log sequence number, because they were modified in the same atomic transaction). So, you will always get at least the documents for a single atomic transaction per physical partition. But from a memory footprint/resource consumption perspective that should be more than sufficient - because the number of document updates in a transaction is also capped (worst case a single bulk/batch might update around 1-5 thousand documents if they are really very small)

The right repository for the Cosmos Spark Connector for Spark 3.* is this repo. https://github.com/Azure/azure-sdk-for-java

Config details can be located here: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/configuration-reference.md