Closed gwbischof closed 4 years ago
I would like to start by testing how Kafka performs on its own, without Streams, so when and if we try Streams we can easily identify whether or not it is an improvement
Thanks for this useful summary.
To revive a point I made in a long-lost thread somewhere in Microsoft Teams, committing to Kafka and using Mongo transactions protects against a class of likely errors. It's true that if we lose network between committing a Mongo transaction and then committing to Kafka that the message has been processed, we'll have a problem. But any errors that happen inside that transaction will be correctly reported back to Kafka as a failure in need of a retry, and that is valuable. Examples of errors this would protect us from include index uniqueness violations, internal Mongo server problems, and document size overruns. In those scenarios, my understanding/expectation is that Kafka can give us some slack, absorbing a backlog for a time while we resolve the issue.
Streams is not for performance improvement. It is slower than using Producer and Consumer directly because of the overhead of exactly-once. I'm not suggesting that we use this now, but keep it in consideration if we want to make a stream processor, which has input and output to kafka. Also I think we should consider separating the processing step and the writing to disk/database step.
Notes from call:
Streams are interesting for future use cases where we consume from one topic and publish to another. One example for this is database migrations. But it is not on the critical path for now.
Transactions are not actually relevant to our concerns. We are publishing and consuming documents one at a time. We never have the situation "Do this complete sequence of tasks or none at all" in the present use case.
Currently, if something goes wrong and a document cannot be saved (e.g. uid collision, over-sized document) pymongo raises an exception which kills the scan, so the user knows immediately that they are not taking data. When we insert a message bus, as long as the Kafka Producer is able to reach the Kafka Broker, the scan will proceed even if the the consumer connected to MongoDB is creating exceptions. In a worst-case scenario, the user might work through the night taking data which is not actually being saved!
In the short term (i.e. in the initial roll-out) we will issue error messages in the logs that clearly distinguish between true duplicates (tried to insert this but a document with the exact some contents already exists in the database) which indicate a message bus hiccup and uid collisions (tried to insert this but a document with different context already exists, and thus this data will not be saved) which indicate a critical failure that might be due to document manipulation upstream of the Producer. These might be ERROR and CRITICAL level log messages, respectively. As a policy, we will only subscribe Producers that write the raw documents topic directly to the RunEngine.
In the longer term, we will add some channel for feedback so that a suspender can interrupt a scan in the event that data may not be being saved.
This was fixed in suitcase-mongo normalized
Definition: Stream processing application is a program the read from kafka- processes the data - writes the result back to kafka.
There are bunch of ways that duplicate documents can be introduced.
Idempotent producer solves the problem of duplicates when writing to kafka, by deduplicating documents.
Kafka transactions semantics were added to enable exactly one processing. This can make reading from a topic, processing, and writing the result to a topic an atomic operation. This works for processing applications that don't require any state. This uses a Kafka transaction coordinator process to achieve atomicity.
Kafka Streams has a general exactly once solution built in, so you don't have to worry about it. https://kafka.apache.org/documentation/streams/ Kafka Streams has both a Producer and Consumer in it, it is a stream processing application. Kafka Streams can also handle stream processing with state. Because it has an additional database, and also writes some information to Kafka. And can recover the state of where it left off when the process restarts. Kafka Streams only has a unofficial python library. The official is in Java.
My conclusion, so far, is that Streams looks really useful. We may want to separate processing from writing to disk/database. And do our processing as stream processing. (Where the output is written back to kafka).
Exactly once with mongo: I don't think commiting to kafka after each document solves the exactly once problem. Because, for example, the network could fail between writing to mongo and the commit. I think that the same is true if we use mongo transactions. Mongo transactions could be good for reducing the commit overhead, but I don't think it solves the exactly once problem. So far I think the best solution is to use mongo to dedupe the documents. This is same strategy that is used by kafka to avoid duplicates from the Producer.
When we are writing to a file, we could commit to kafka only after the file is completed.
I think a little more investigation is needed. I think we should decide if the unofficial Python Streams library is useable, would it be better to do stream processing in Java? Or not use Streams at all?