Closed roadruuner closed 1 year ago
Hi, itemCountPerTriggerHint will allow you to modify the max. memory/resource consumption per micro batch. it is only a hint because change feed in Cosmos DB will always include at least all documents of a single atomic transaction (all sharing the same LSN - log sequence number, because they were modified in the same atomic transaction). So, you will always get at least the documents for a single atomic transaction per physical partition. But from a memory footprint/resource consumption perspective that should be more than sufficient - because the number of document updates in a transaction is also capped (worst case a single bulk/batch might update around 1-5 thousand documents if they are really very small)
The right repository for the Cosmos Spark Connector for Spark 3.* is this repo. https://github.com/Azure/azure-sdk-for-java
Config details can be located here: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/configuration-reference.md
I am using Spark Structured Streaming with asuze-cosmos-spark_3-1_2-2 % "4.16.0"
I need to cap how much data I can read. Tried setting the following: spark.cosmos.changeFeed.itemCountPerTriggerHint = 10 spark.cosmos.read.maxItem = 10
spark.cosmos.changeFeed.startFrom = "Beginning" spark.cosmos.changeFeed.mode = "Incremental"
Using val changeFeedDF = spark.readStream .schema(customSchema) .format("cosmos.oltp.changeFeed") .options(readConfig) .load