beyondstorage / go-storage

A vendor-neutral storage library for Golang: Write once, run on every storage service.
https://beyondstorage.io
Apache License 2.0
559 stars 54 forks source link

Idea: Notification (CDC) Support #633

Closed Xuanwo closed 3 years ago

Xuanwo commented 3 years ago

Our Storage Service may support sending notifications to let users get the changes of storage.

This feature likes CDC(Change Data Capture) for DBMS.

We may need to:

Xuanwo commented 3 years ago

Maybe we can't finish this work in go-storage alone?

We need to build whole CDC services:

Xuanwo commented 3 years ago

@xxchan is working on this idea.

xxchan commented 3 years ago

Here are some of my thoughts.

Why we may need to support this feature?

I think currently go-storage provides a unified interface to access storage services, and this feature is beginning to support configuring a complex feature for storage services.

Supporting notification configuration is the first step (feature 1). We should consider how users use notifications and how to help them.

When users may need this feature?

Although we may design a feature that can be used without go-storage, but I think we should start from go-storage.

I think a user is willing to use go-storage for:

In the first case, notification is probably not needed(?). In the second case, let's consider how does he use the notification.

Notification data flow

An event notification may flow in different paths:

image

If the user uses Lambda, I guess he may be willing to stick to the vendor and don't need us(?). If he uses queue service, I'm not sure.

If he uses go-storage and sets notification destination to a customized server (Does it mean this feature has limited use cases?), then he will need to handle the specific notification format (e.g., oss event message, s3 event message), which avoids the purpose of "vendor agnostic".

So we can define a unified storage event message format for users. We can provide a library to convert vendor event message formats into ours (feature 2). (This can be analogous to https://github.com/xo/dburl, with which users can convert a unified connection string format into vendor ones)

As @Xuanwo mentioned that we may support different notification receivers, the event "destination" (customer managed server, subscribing notification as an HTTP endpoint) may further send event messages downstream, and thus we may help provide a unified interface for publishing messages (like a unified (maybe more than) MQ interface) (feature 3). We can even let the server simply forwarding messages as a dedicated halfway station (feature 4, using features 2 & 3).

Summary

Now we have 4 possible features:

  1. Configure notification in go-storage
  2. a library to convert vendor event message formats into a unified one
  3. a unified interface for publishing messages (analogy to go-storage?)
  4. a message forwarding application combining 2 & 3.

I think features 1 & 2 are very reasonable. But I doubt the use cases of feature 4. Will users use a server just to forward messages without processing data? If so, it may also involve tricky things to consider, e.g., message delivery guarantee (retries, ordering, and deduplication). Finally, It seems that feature 3 (a general one, not only serving feature 4) is beyond the scope of our organization.

Xuanwo commented 3 years ago

Nice thoughts! Let's resolve questions here.

In the first case, notification is probably not needed(?).

Take data migration and backup as examples, notification is needed to implement the incremental process. For example, with notification support, we can implement incremental migration so that we don't need to list all objects (which is very slow on huge buckets).

If the user uses Lambda, I guess he may be willing to stick to the vendor and don't need us(?).

We are focused on the storage layer itself, so the notification here is the native notification provided by storage services. That means:

Will users use a server just to forward messages without processing data?

Nice question.

Features 3&4 are indeed out of our community scopes. The reason why I include them here is: Between features 1 and 2, we need a service to receive the events. And feature 3&4 is the extension of this service.

The workflow looks like this:

It's OK for me to wipe this service & feature 3&4 out of this proposal, we can discuss them later (maybe when dm plan to implement the incremental data migration).

Xuanwo commented 3 years ago

ping dm's maintainer @Prnyself to take a look.

xxchan commented 3 years ago

Between features 1 and 2, we need a service to receive the events.

I think this is just an HTTP server, so it should be decided by users themselves?

Xuanwo commented 3 years ago

I think this is just an HTTP server, so it should be decided by users themselves?

You are right. Let's focus on our job and don't take the service into consideration.

Prnyself commented 3 years ago

Nice thoughts!

As a service-user, especially for an application based on Golang, being able to get a channel for notification is necessary and fundamental.

What's more, webhook or 3rd party message queue should also be supported in the future.

So it is really similar with the relationship between go-storage and go-service-xxx, if we want to support different message services.

But for now, I think we can firstly define the notification sturct, find out what infomation we need to send in notification. Maybe take the badger's db.Subscribe as a reference?

Xuanwo commented 3 years ago

@xxchan Hi, what's the progress?

xxchan commented 3 years ago

find out what infomation we need to send in notification

@Prnyself, to make it clear, I think we are not going to support "sending notifications", since this is an internal feature of storage services. We just enable users to turn it on with go-storage, and we cannot decide "what information to send in notification".

We can decide "what information is commonly needed in received notification" and define a unified format.


@Xuanwo My current plan is:

  1. Support notification configuration in go-storage (Set receiver to the cloud notification service or an HTTP endpoint).
  2. Define a unified storage event message format (or simply a go struct) along with a library to convert vendor event message formats into it.

If this is okay, I will draft an RFC for 1 soon.

Xuanwo commented 3 years ago

@Xuanwo My current plan is:

1. Support notification configuration in go-storage (Set receiver to the cloud notification service or an HTTP endpoint).

2. Define a unified storage event message format (or simply a go struct) along with a library to convert vendor event message formats into it.

If this is okay, I will draft an RFC for 1 soon.

The plan looks good to me!

xxchan commented 3 years ago

Here's a (not verified) table of storage event types. We can see that they vary a lot:

  1. Different services have different advanced feature, e.g., download for s3, metadata update for gcs, abort_multipart for qingstor. If we omit them, the most basic and common events are only create & delete.
  2. Only half of the services support fine-grained event types.
  3. There's inconsistent behaviour: e.g., oss counts InitiateMultipartUpload & UploadPart as create event, while s3 and cos don't.

So I think this means that storage event is highly service-related and thus it is hard to provide a comprehensive unified event format.

oss s3 cos gcs qingstor azblob
ObjectCreated *
ObjectCreated:PutObject
ObjectCreated:PostObject
ObjectCreated:CopyObject
ObjectCreated:InitiateMultipartUpload
ObjectCreated:UploadPart
ObjectCreated:UploadPartCopy
ObjectCreated:CompleteMultipartUpload
ObjectCreated:AppendObject
ObjectDownloaded ObjectDownloaded:GetObject
ObjectRemoved *
ObjectRemoved:DeleteObject
ObjectRemoved:DeleteObjects
version delete
ObjectReplication *
ObjectReplication:ObjectCreated
ObjectReplication:ObjectRemoved
ObjectReplication:ObjectModified
OperationFailedReplication
metadata update
abort_multipart
xxchan commented 3 years ago

The APIs of configuring notification are similar (but oss does not have this API!). Params are: bucket name, event (type, filter, id, arn). The most tricky thing is event type. It seems hard to give a global event type (like global pairs)

Xuanwo commented 3 years ago

Another thing I found out is that some services (like s3) only support sending events to internal services like Amazon SNS, Amazon SQS, or AWS Lambda.

xxchan commented 3 years ago

Another thing I found out is that some services (like s3) only support sending events to internal services like Amazon SNS, Amazon SQS, or AWS Lambda.

Actually only qingstor supports HTTP endpoint directly.

xxchan commented 3 years ago

And it seems to be encouraged to configure notification in the console instead of using API 🤔

Xuanwo commented 3 years ago

So we now have two difficulties.

xxchan commented 3 years ago

For the second problem, my previous idea is to use SNS as a middle station, and add an HTTP endpoint subscription to the SNS topic. If so, the user will have to also provide the SNS arn besides HTTP endpoint.

Xuanwo commented 3 years ago

If so, the user will have to also provide the SNS arn besides HTTP endpoint.

But SNS arn is also very different between services? Can we create SNS for user?

xxchan commented 3 years ago

Can we create SNS for user?

some quick results (whether have CreateTopic API):

xxchan commented 3 years ago

But SNS arn is also very different between services?

Not sure. Example:

My previous concern was that if the user will go to the console to create a topic, why doesn't he just continue to configure the notification there? So "Can we create SNS for user?" is a problem.

Xuanwo commented 3 years ago

Let's discuss event type later, it's a bit simpler.

My previous concern was that if the user will go to the console to create a topic, why doesn't he just continue to configure the notification there? So "Can we create SNS for user?" is a problem.

So there are two methods:

Xuanwo commented 3 years ago

Maybe related to #634

xxchan commented 3 years ago

Is creating a service implicitly acceptable to users? One thing is that it involves billing.

Xuanwo commented 3 years ago

So there are two methods:

* API that accepts the dst endpoint: that means we need to create an SNS service for the user if the service doesn't have native support.

* API that accepts service internal ARN (in a plain string): that means the user needs to create SNS service by themself.

For method 1: I agree with your concern, it's not acceptable. For method 2: It looks meaningless for users (why not config them in console directly?)

Maybe it's out of our scope to implement the notification config API (And we don't have the ability for it), let's wipe them out.


Without the notification API support, do you think it still useful to implement a global event struct type?

xxchan commented 3 years ago

I think users may write this themselves with few lines of code and won't try to find a simple library to do so.

Xuanwo commented 3 years ago

Let's mark this idea as a backlog, and drop it for now, thanks to your research!

Xuanwo commented 3 years ago

How about implement CDC via scanning? Like rockset does: https://rockset.com/blog/change-data-capture-what-it-is-and-how-to-use-it/

Change Data Capture: What It Is and How to Use It
Change data capture (CDC) is a useful tool in many data architectures. Learn what CDC is, how it is implemented and when to use it.