Open pral2a opened 1 year ago
I can't speak to whether Kafka specifically is the right tool for this, but if this presented an opportunity to simplify the MQTT / Data ingest part of the rails app I would very much welcome it - it's one of the areas with the most complexity and the least test coverage at the moment, so making changes there in its current state is very risky (see #263 )
Moving the data pipeline out of rails as far as possible (and using it only as an API frontend over the datastore) i think would be the ideal. Something like Spark might well be very useful to replace portions of the current data ingest pipeline in that case
I'm going to start having a look at the practicalities of this - Kafka Streams might be even better than Spark, and there's a MQTT-to Kafka bridge available which could potentially simplify things considerably.
I'm gonna spend some time next week working out what would need to be replaced / rewritten in this system, and what the interfaces and boundaries would be with our current architecture, which should give us a better idea of costs and benefits.
@oscgonfer and I had a chat this morning with Rune and Robert from NILU about how they use Kafka as part of the CitiObs project which was quite illuminating.
A few useful points:
1) They use Kafka to broker and queue incoming data, which is then consumed by Apache Spark jobs for validation and for ingestion into an HBase database. This is similar to our proposed use case, (but also includes the role that MQTT currently plays for us as they receive data via a REST API rather than a messaging protocol) 2) They use the Confluent Platform on their own hardware (ie not confluent cloud) under the community licence - they had the same reservations at first as I had about leaning too hard into a proprietary platform but so far it has not caused any practical issues, and Confluent provides useful features (such as validation of incoming data against a schema at the point of ingest). This is especially useful for us as confluent maintain a MQTT proxy for kafka which seems difficult to use outside their platform. 3) Brokering messages to different consumers who consume at different rates is not a problem but requires care - all this is configurable by the developer in Kafka. 4) They enrich incoming messages with schema data from their postgres database at the point of ingest before hitting the broker, which might be a neat way of getting around some of the tricky parts of providing data to third party consumers that Oscar and I were discussing before Christmas.
I'll add more here if anything occurs to me! I think a useful next step when we have some more time would be to look into doing a little spike similar to my previous one using the Confluent stack to see whether that offers any advantages.
Just dropping a quick message here. I saw a potential alternative for this: https://nats.io/ @timcowlishaw
Issue Description
The existing data ingestion architecture, which relies on MQTT for IOT communication, KairosDB for Time Series storage, and a set of Rails functions orchestrated with Sidekiq and Redis, presents certain challenges that we should address. I propose evaluating the migration to a unified event streaming platform, Apache Kafka, while leveraging tools like karafka for Rails or Racecar, among others.
Design Goals
Improve Data Durability and Availability
By transitioning to Apache Kafka, we can enhance data durability and availability. Kafka provides robust data storage and replication mechanisms, ensuring data resilience even in the face of failures.
Make Internal Data Flows More Explicit
The proposed architecture will make data flows within the system more explicit and easier to understand. Kafka's topic-based approach allows for clear data segregation and routing.
Reduce Code Base Complexity
Simplifying the architecture by consolidating on Kafka can potentially help reduce the codebase's complexity. This simplification can lead to easier maintenance and troubleshooting.
Address Potential Issues with the current MQTT Handler Gem
The current MQTT handler gem may have limitations or issues that we can overcome by connecting MQTT to Kafka directly. This might provide more flexibility and robustness in handling MQTT data.
Additional Benefits
Integration with Apache Spark
The proposed Kafka-based architecture can seamlessly integrate with Apache Spark. That opens up possibilities for utilizing Spark Streaming to process data in real-time. The architecture might also help to bring postprocessing tasks currently using the standard public API closer to the platform while keeping them as independent Python definitions within the smartcitizen-data framework.
Next steps