dlt-hub / verified-sources

Contribute to dlt verified sources 🔥
https://dlthub.com/docs/walkthroughs/add-a-verified-source
Apache License 2.0
48 stars 38 forks source link

google cloud pubsub verified source #489

Open rudolfix opened 3 weeks ago

rudolfix commented 3 weeks ago

Quick source info

implement a dlt resource that retrieves messages from google cloud pub sub.

The challenge here is that dlt is processing messages (data) in batches going through extract (getting messages from pubsub), normalize (creating relational schema) and load (ie. to bigquery). The pubsub mechanism is a simple message broker where each messages is acknowledged with a certain deadline. It is then quite hard to do exactly once delivery - where no messages are lost but we also have no duplicates.

Note that our Kinesis and Kafka sources are using offsets for robust exactly once delivery. The offsets from the previous extract are committed just before next extract from dlt state which is stored together with the data - when we move the offset we have a guarantee that previous package is fully stored in the destination. This behavior may be hard to replicate with message ack mechanism.

Nevertheless, we need to have a good idea how to achieve such thing for pubsub. Such simple brokers are quite popular and we should give our users the best possible way to read from them.

Current Status

What source does/will do

Typical use case is a stream of telemetry messages with json based payload that are delivered to a pub sub subscription point and then pulled from cloud function, normalized and stored in relational database.

Typically messages are dispatched to different tables depending on message type (each type has a different schema) and data contracts (ie Pydantic models) are used to filter out "bad data".

Typical message feeds:

Test account / test data

Additional context

Implementation Requirements

  1. must be able to work inside and outside of google cloud (authentication via GcpCredentials like ie. in google ads / google sheets / bigquery sources and destinations)
  2. suitable for running on cloud function where the max run time and local storage are limited so max consume time and max messages should be observed.
  3. "no more messages" situation must be detected - we do not want to wait for max consume time
  4. implement a mechanism as close to exactly once delivery as possible by manipulating ask delivery deadlines, time spend consuming vs. total time. duplicated messages are better than lost ones.
  5. pubsub metadata should be attached to the message payload data like we do it for Kinesis / Kafka (user can always remove it via dlt transform)
  6. look at Kinesis/Kafka and Scrapy sources. We may need a helper / wrapper functions to run dlt pipeline in a way that does ack on messages.
  7. Like for any other sources: demo pipelines and documentation must be provided

Test Requirements

  1. tests should create a unique subscription, test messages and drop it all at the end (see Kafka / Kinesis)
  2. mind that many test runs may happen at the same time on CI so tests must be isolated
  3. local pubsub emulator and real service are an option

We have community provided consumers for pub sub which may be a starting point for this implementation