fablabbcn / smartcitizen-api

The Smart Citizen Engine
https://developer.smartcitizen.me
GNU Affero General Public License v3.0
10 stars 4 forks source link

Proposal: Migrating to Apache Kafka for Enhanced Data Processing and Reliability #265

Open pral2a opened 12 months ago

pral2a commented 12 months ago

Issue Description

The existing data ingestion architecture, which relies on MQTT for IOT communication, KairosDB for Time Series storage, and a set of Rails functions orchestrated with Sidekiq and Redis, presents certain challenges that we should address. I propose evaluating the migration to a unified event streaming platform, Apache Kafka, while leveraging tools like karafka for Rails or Racecar, among others.

Design Goals

Improve Data Durability and Availability

By transitioning to Apache Kafka, we can enhance data durability and availability. Kafka provides robust data storage and replication mechanisms, ensuring data resilience even in the face of failures.

Make Internal Data Flows More Explicit

The proposed architecture will make data flows within the system more explicit and easier to understand. Kafka's topic-based approach allows for clear data segregation and routing.

Reduce Code Base Complexity

Simplifying the architecture by consolidating on Kafka can potentially help reduce the codebase's complexity. This simplification can lead to easier maintenance and troubleshooting.

Address Potential Issues with the current MQTT Handler Gem

The current MQTT handler gem may have limitations or issues that we can overcome by connecting MQTT to Kafka directly. This might provide more flexibility and robustness in handling MQTT data.

Additional Benefits

Integration with Apache Spark

The proposed Kafka-based architecture can seamlessly integrate with Apache Spark. That opens up possibilities for utilizing Spark Streaming to process data in real-time. The architecture might also help to bring postprocessing tasks currently using the standard public API closer to the platform while keeping them as independent Python definitions within the smartcitizen-data framework.

Next steps

timcowlishaw commented 11 months ago

I can't speak to whether Kafka specifically is the right tool for this, but if this presented an opportunity to simplify the MQTT / Data ingest part of the rails app I would very much welcome it - it's one of the areas with the most complexity and the least test coverage at the moment, so making changes there in its current state is very risky (see #263 )

Moving the data pipeline out of rails as far as possible (and using it only as an API frontend over the datastore) i think would be the ideal. Something like Spark might well be very useful to replace portions of the current data ingest pipeline in that case

timcowlishaw commented 11 months ago

I'm going to start having a look at the practicalities of this - Kafka Streams might be even better than Spark, and there's a MQTT-to Kafka bridge available which could potentially simplify things considerably.

I'm gonna spend some time next week working out what would need to be replaced / rewritten in this system, and what the interfaces and boundaries would be with our current architecture, which should give us a better idea of costs and benefits.

timcowlishaw commented 9 months ago

@oscgonfer and I had a chat this morning with Rune and Robert from NILU about how they use Kafka as part of the CitiObs project which was quite illuminating.

A few useful points:

1) They use Kafka to broker and queue incoming data, which is then consumed by Apache Spark jobs for validation and for ingestion into an HBase database. This is similar to our proposed use case, (but also includes the role that MQTT currently plays for us as they receive data via a REST API rather than a messaging protocol) 2) They use the Confluent Platform on their own hardware (ie not confluent cloud) under the community licence - they had the same reservations at first as I had about leaning too hard into a proprietary platform but so far it has not caused any practical issues, and Confluent provides useful features (such as validation of incoming data against a schema at the point of ingest). This is especially useful for us as confluent maintain a MQTT proxy for kafka which seems difficult to use outside their platform. 3) Brokering messages to different consumers who consume at different rates is not a problem but requires care - all this is configurable by the developer in Kafka. 4) They enrich incoming messages with schema data from their postgres database at the point of ingest before hitting the broker, which might be a neat way of getting around some of the tricky parts of providing data to third party consumers that Oscar and I were discussing before Christmas.

I'll add more here if anything occurs to me! I think a useful next step when we have some more time would be to look into doing a little spike similar to my previous one using the Confluent stack to see whether that offers any advantages.

oscgonfer commented 6 months ago

Just dropping a quick message here. I saw a potential alternative for this: https://nats.io/ @timcowlishaw