HSLdevcom / transitlog

Explore observed public transport and compare with the intended traffic
https://reittiloki.hsl.fi/
Creative Commons Attribution 4.0 International
5 stars 1 forks source link

Write HFP deduplication #5

Open haphut opened 5 years ago

haphut commented 5 years ago

HFP stream contains duplicate messages due to MQTT QoS 1. And if we run multiple instances of pulsar-mqtt-source, we need to deduplicate those streams as well. Keep the first copy of each unique message.

Implement using Pulsar Functions.

Run on Docker host / Docker Swarm using Pulsar Admin API or CLI.

Ask Pulsar devs whether they are interested in: 1) Streaming compaction of topics instead of cron jobs. 2) Compaction by retaining only the first instances of unique messages, not the last.

paasovaara commented 5 years ago

Latest thoughts:

Pulsar log compaction won't work because it keeps the last one and requires manual invocation. Pulsar deduplication won't work since it's only for making sure same consumer message won't be published twice and uses MessageId's for that.