To start with, let's write a topology that consumes positions and inst ref data, "joins" the two based on xrefs (using latest xref mapping that is available), and then emits the position + inst_id to a new topic (let's call it normalizedpositions[scenario suffix]). We would like to utilize "exactly-once" semantics, allow partitioned consumption of data, and be fault-tolerant.
Let's walk through a scenario to illustrate the desired behavior
Position event shows up with [RIC=IBM.N, ACCT=ABC, QTY=100,KD=x,ED=y,...] (I've excluded some fields for brevity)
We should write an event to normalized_positions that has [RIC=IBM.N, ACCT=ABC, QTY=100,KD=x,ED=y,...] + [INST_ID=DERP]
Inst Ref Data event shows up with [RIC=IBM.N, INST ID=123,...] (I've excluded some fields for brevity)
We should write an event to normalized_positions that has [RIC=IBM.N, ACCT=ABC, QTY=100,KD=x,ED=y,...] + [INST_ID=123]
Position event shows up with [RIC=IBM.N, ACCT=ABC, QTY=300,KD=x,ED=z,...]
We should write an event to normalized_positions that has [RIC=IBM.N, ACCT=ABC, QTY=300,KD=x,ED=z,...] + [INST_ID=123]
Inst Ref Data event shows up with [RIC=IBM.N, INST ID=781,...]
We should write two events to normalized_positions: [RIC=IBM.N, ACCT=ABC, QTY=300,KD=x,ED=z,...] + [INST_ID=781] and [RIC=IBM.N, ACCT=ABC, QTY=100,KD=x,ED=y,...] + [INST_ID=781] . Of course the key question here is how far back in time we should go. I think it's fair to say that we need to go back and replay the last 3 business days of events (using event time not ingestion time) but this is something worth discussing with @paulo-reichert , @j-barber , and @ari-galatea .
Let's start to explore kafka streams (using the java dsl if possible): https://docs.confluent.io/current/streams/index.html
To start with, let's write a topology that consumes positions and inst ref data, "joins" the two based on xrefs (using latest xref mapping that is available), and then emits the position + inst_id to a new topic (let's call it normalizedpositions[scenario suffix]). We would like to utilize "exactly-once" semantics, allow partitioned consumption of data, and be fault-tolerant.
Let's walk through a scenario to illustrate the desired behavior
Once we have this behavior in place, let's run the simple scenario through and document our resulting performance here: https://docs.google.com/spreadsheets/d/1uyF95rRDpu8VNvaGw4osMspev5LZruh1UknGWh-d3Ko/edit?usp=drive_web&ouid=115206558195598709955