datacontract / datacontract-cli

CLI to manage your datacontract.yaml files
https://cli.datacontract.com
Other
470 stars 89 forks source link

Add pubsub as an option for server #58

Closed adaminsta closed 8 months ago

adaminsta commented 8 months ago

We are consuming data from a pub-sub topic and putting it into snowflake, but there is no pubsub option for servers. Need something like this:

source:
  type: pubsub
  project: gcp_project_name
  name: my_topic
 endpoint:
    type: snowflake
    account: xxxxxxxx
    database: my_db
    schema: my_schema
jochenchrist commented 8 months ago

@adaminsta thanks for suggesting, I think this makes sense, equivalent to kafka.

jochenchrist commented 8 months ago

@adaminsta have a look here: https://github.com/datacontract/datacontract-specification/pull/33/files

adaminsta commented 8 months ago

Looks great!

jochenchrist commented 8 months ago

Closed in https://github.com/datacontract/datacontract-specification/pull/33

jochenchrist commented 8 months ago

Do you have the need for testing the messages in pub/sub? Do you have any suggestions what engines could help here?

adaminsta commented 8 months ago

We are planning to consume messages from pubsub with dataflow and then merge the data into snowflake tables. My initial idea was to validate format of the input in the dataflow job by comparing with the contract yaml.

An even more powerful solution would be if we can genereate a pubsub schema (only avro/protocol buffer) from the contract yaml, which I can attach to the pubsub topic (https://cloud.google.com/pubsub/docs/schemas)

simonharrer commented 8 months ago

We're planning on such export functionality. See #56 and #57 Feel free to provide us some examples (data contract -> avro schema, data contract -> protobuf) there to help us drive the implementation.