galatea-associates / kafka-stress-test-poc

A stress testing application for Apache Kafka
0 stars 0 forks source link

Read data for "Scenario Simple" and normalize position using kafka streams #63

Open GalateaRaj opened 5 years ago

GalateaRaj commented 5 years ago

Let's start to explore kafka streams (using the java dsl if possible): https://docs.confluent.io/current/streams/index.html

To start with, let's write a topology that consumes positions and inst ref data, "joins" the two based on xrefs (using latest xref mapping that is available), and then emits the position + inst_id to a new topic (let's call it normalizedpositions[scenario suffix]). We would like to utilize "exactly-once" semantics, allow partitioned consumption of data, and be fault-tolerant.

Let's walk through a scenario to illustrate the desired behavior

  1. Position event shows up with [RIC=IBM.N, ACCT=ABC, QTY=100,KD=x,ED=y,...] (I've excluded some fields for brevity)
  2. We should write an event to normalized_positions that has [RIC=IBM.N, ACCT=ABC, QTY=100,KD=x,ED=y,...] + [INST_ID=DERP]
  3. Inst Ref Data event shows up with [RIC=IBM.N, INST ID=123,...] (I've excluded some fields for brevity)
  4. We should write an event to normalized_positions that has [RIC=IBM.N, ACCT=ABC, QTY=100,KD=x,ED=y,...] + [INST_ID=123]
  5. Position event shows up with [RIC=IBM.N, ACCT=ABC, QTY=300,KD=x,ED=z,...]
  6. We should write an event to normalized_positions that has [RIC=IBM.N, ACCT=ABC, QTY=300,KD=x,ED=z,...] + [INST_ID=123]
  7. Inst Ref Data event shows up with [RIC=IBM.N, INST ID=781,...]
  8. We should write two events to normalized_positions: [RIC=IBM.N, ACCT=ABC, QTY=300,KD=x,ED=z,...] + [INST_ID=781] and [RIC=IBM.N, ACCT=ABC, QTY=100,KD=x,ED=y,...] + [INST_ID=781] . Of course the key question here is how far back in time we should go. I think it's fair to say that we need to go back and replay the last 3 business days of events (using event time not ingestion time) but this is something worth discussing with @paulo-reichert , @j-barber , and @ari-galatea .

Once we have this behavior in place, let's run the simple scenario through and document our resulting performance here: https://docs.google.com/spreadsheets/d/1uyF95rRDpu8VNvaGw4osMspev5LZruh1UknGWh-d3Ko/edit?usp=drive_web&ouid=115206558195598709955

GalateaRaj commented 5 years ago

@paulo-reichert , @j-barber , @ari-galatea - feel free to add any other details you would like to explore for this use case.