ChangeFeedStartFrom confusion

festivus commented 2 years ago

I'm confused by the change feed. I was hoping to be able to cycle through all the changes to a single document in a container from the beginning of the container, but it seems only the latest change is ever returned. So my question is what is the point of setting the ChangeFeedStartFrom when you only ever get the latest change?

bartelink commented 2 years ago

The changefeed presents all the documents/items in the container. Within a given logical partition, you're guaranteed to see the oldest updates first. If you see a give document along the journey, but it's updated after you saw it, you'll see it again. That lets you maintain syncing of the latest state of everything.

The start point defines how far back in history you want to start from.

For instance, if you want to only sync stuff that's been inserted/updated in the last year, you can filter out older stuff using this mechanism.

(There is a Full Fidelity ChangeFeed feature in private beta that can show every update, but that's only within a constrained timespan i.e. 24h)

festivus commented 2 years ago

So, for me, when I see this comment "For instance, if you want to only sync stuff that's been inserted/updated in the last year, you can filter out older stuff using this mechanism.", i read that as i will see all updates. Should that really read "inserted/last updated"??

The documentation is confusing to me. I see things that lead me to believe that I can go back through a container item's history, but that is not in fact the case unless I use the change processor push model (from the very beginning) or build a pull model. But again, either has to have been running from the get go in order to get the full history.

bartelink commented 2 years ago

i read that as i will see all updates. Should that really read "inserted/last updated"??

If you obtained it from the changefeed 2s ago, and you update it in reaction to seeing it, you will see it again a second later.

Another observer may only see only the final state. I was seeking to convey that you are guaranteed to observe the final state.

The documentation is confusing to me. I see things that lead me to believe that I can go back through a container item's history, but that is not in fact the case

I'd suggest raising issues on the doc site regarding anything that confuses you. It's definitely counterintuitive - the name does not help convey what it does.

If the doc site doesnt permit PRs or comments, log it here and someone will route the issue to the right person.

There is nothing stashed to be fed to anyone. There's a cursor per physical partition. The CFP happens to maintain those per-partition containers as documents/items in an 'aux' container (which double up as markers as to which instance of a cluster of consumers has leased that partition atm). Not entirely like Kafka in terms of how the offsets are held per physical partition. Where the analogy breaks down is that there are no messages/copies in brokers being routed around - it's just continuous queries running against the container's nodes.

unless I use the change processor push model (from the very beginning) or build a pull model.

I have no idea what you mean by this but the bottom line is that no matter what way you look at it, there is a single copy of any item/document in the system at any moment in time (ok there are replicas, and you can get a stale version depending on the consistency model, and multi-master model even means there can be multiple modified versions in flight etc).

Push vs Pull does not change any of those facts.

But again, either has to have been running from the get go in order to get the full history.

So, again, I need to repeat: there is no "full history"

However, you do have the ability to walk all the documents in a consistent way. What I wouldn't give to have such a feature in the thing I'm working on right now ;)

bartelink commented 2 years ago

i read that as i will see all updates.

Note I was attempting to convey the use case for the feature. TL;DR if I have a TB of data and I maintain some derived model, I can avoid having to walk all the documents, if I know that there will not be information that's relevant to my needs contained in documents last touched (inserted or updated) before a given cutoff.

One other thing to note with the CFP wrt this is that the starting date is only used when you establish a given change feed 'cursor' - i.e. if you want to restart that crawl from that nominated position, you are out of luck. (See https://github.com/Azure/azure-cosmos-dotnet-v3/issues/510)

I have logged some issues like this, inc ones asking for doc updates in https://github.com/Azure/azure-documentdb-changefeedprocessor-dotnet - you may find reading those threads help you understand things better. (Note the CFP lib is effectively unmaintained - this V3 repo integrates the change feed support with the client library, whereas in V2 times, the CFP lib was an OSS extra to a closed source client. While the pull model etc has been added, the fundamentals are very much derived from (and can interoperate with) that)

festivus commented 2 years ago

Thank you, I really appreciate the response. It's clarified some things for me.

When I mentioned the push\pull model and having to start from the beginning. If I want to track an item's full history, I need to do this on my own by doing the push\pull model. And if I want the full history from the beginning, I need to have the push\pull process in place from the beginning. I can't retroactively put in place a push\pull model and expect to see the entire history of an item.

bartelink commented 2 years ago

OK but for avoidance of doubt, you can still miss updates. There's a loop polling the start of the thing. Any time an item is updated, it will move 'past your cursor', and you will see it again. But if you do two updates, and you don't read and process the first one before it jumps ahead of your cursor again, you won't even be aware it had 2 updates. (Note also that you are only guaranteed to 'read your writes' if the consistency model says so, i.e. if you are running a CFP consumer and reading from the Container in the same process (as opposed to using the document copy from the Change Feed) with Session Consistency, you need to be using the same CosmosClient for the Session Token to be correctly synced to guarantee you will see the same thing you just wrote)

festivus commented 2 years ago

Excellent explanation, thank you.

Azure / azure-cosmos-dotnet-v3

ChangeFeedStartFrom confusion #3144