bluesky / bluesky-kafka

Kafka integration for bluesky
Other
5 stars 10 forks source link

Add msg.partition as an argument to BlueskyConsumer.process_document #17

Open gwbischof opened 4 years ago

gwbischof commented 4 years ago

RunRouter cannot always tell which Run a datum or resource belongs to because resource documents don't always have a run_start uid. When a resource or datum's run can't be identified then it is distributed to all runs. https://github.com/bluesky/event-model/blob/master/event_model/__init__.py#L1295

I am planning on working on an archiver consumer this week, that will create an archive file for each run. I think that this scenario is a problem because because we don't want documents from one Run, to end up in the archive of a different Run.

We can solve this problem by passing msg.partition to the BlueskyConsumer.process_document. https://github.com/bluesky/bluesky-kafka/blob/master/bluesky_kafka/__init__.py#L405

The Producer publishes runs to a partition based on a UID. If the consumer subscribes to a topic it will receive documents from multiple partitions. If it passes documents from all partitions directly to a RunRouter, this is equivalent to merging all of the partitions together. The RunRouter cannot perfectly separate the Runs due to missing run_start in the resource documents.

If we pass the msg.partition to process_document we can then keep the partitions isolated, which can solve this problem.

Is it possible to do two BlueskyRuns at the same time and have them end up on the same partition, so that we would need to separate runs that are part of the same stream? If not then we know that each time we see a new start document on a partition a new run has started.