HumanCellAtlas / data-store

Design specs and prototypes for the HCA Data Storage System (DSS, "blue box")
https://dss.staging.data.humancellatlas.org/
Other
40 stars 6 forks source link

Native cloud notifications #1176

Open dshiga opened 6 years ago

dshiga commented 6 years ago

Currently, the only option for receiving notifications from the data-store is via webhook callbacks. While that has some advantages, it would be very helpful if subscribers had the option to instead receive notifications via native cloud messaging services like Google Pub Sub or Amazon SQS.

This could be an attractive option for users because:

  1. Subscribers wouldn't have to deal with big spikes in notification rates by scaling up web services or lambdas, or by setting up and maintaining a queue.
  2. It would lower the barrier to entry for consuming blue box notifications, since the subscriber wouldn't necessarily have to set up and maintain a web service or lambda to listen for notifications.

In general, this method would help subscribers stay loosely coupled to the blue box. It would also reduce the burden on blue box for retrying failed notifications to subscribers.

We saw issues in a recent 1k bundle scale test where the secondary analysis service missed notifications because we had trouble keeping up with the rate at which notifications were sent. Moving to Pub Sub would remove this problem.

I've also been asked whether Pub Sub is an option by others interested in consuming notifications from blue box.

hannes-ucsc commented 6 years ago

@dshiga, some of the issues you mention will be addressed by the upcoming reliable notifications PR. If the subscriber misses some of the notifications, Blue will retry with exponential back-off. As long as the subscriber 1) doesn't return 2xx when in fact it dropped the notification and 2) does return a 2xx if it did process the notification. During the 1000 run you mention, the numbers I ran indicate that Green broke rule 2: there were 160 or so 5xx codes but only 51 missed bundles.

It would also reduce the burden on blue box for retrying failed notifications to subscribers.

I think Blue is committed to support webhooks so therefore needs to manage our own queues for reliable notifications. Unless we can completely eliminate webhook support, the burden is not reduced by adding this feature.

However, let's say you're not happy with the semantics even after Blue merges reliable notifications and you would still like to use your own queues: I think we can get this feature almost for free, if we use signed POSTs against SQS. The signed URL will be the endpoint URL for the webhook. The reliable notifications PR improves the flexibility in configuring the webhook. You can chose between PUT and POST, JSON and form-encoded bodies. We use this internally for testing with signed S3 URLs as the webhook. A notification causes an object in S3 to be created. This should be possible with SQS, too, since it also supports signed URLs: http://boto3.readthedocs.io/en/latest/reference/services/sqs.html#SQS.Client.generate_presigned_url.

There maybe some minor technical details that need to be worked out. For example, we don't support GET as a method, or url-encoded bodies.

My recommendation is that we wait and see how the reliable notification feature improves the situation before committing to this.

hannes-ucsc commented 6 years ago

I don't think signed PUT/POST is going to work because they don't allow for varying payloads.

But again, let's see how reliable notifications improves the situation before we decide. It was merged last Friday but I had to back it out because it broke the CLI. I've since filed a replacement PR that addresses that. Hopefully it'll get merged today: #1178 and #1171.

dshiga commented 6 years ago

Thanks for your thoughts on this @hannes-ucsc. I'm confident that reliability will be improved by your PR. My concern is more about rate limiting.

Currently we propagate bursts of notifications through all the internal components of the green box and make no effort to queue anything. That didn't work so well in the 1k scale test, however, as not everything inside green box dealt well with the notification rates we saw. So we are going to need queuing in the green box and the most sensible place to do it is at the very beginning, where the notifications first come into the system.

We can and will build this if needed, but I would prefer not to have to maintain a web service whose purpose is merely to receive callbacks and immediately insert them into a queue. It would be preferable if the blue box could write notifications directly to a Pub Sub topic or SQS queue for us.

I am okay with waiting and seeing how the next scale test goes, but that is probably a couple of weeks off at best and it means there will be less time for blue box to implement something along these lines if needed.

hannes-ucsc commented 6 years ago

Makes sense.

Which one would you prefer PubSub or SQS?

How would we handle authentication? Generally speaking, we can't assume that the subscriber's queue lives in the same GCP project or AWS account as DSS.

dshiga commented 6 years ago

Pub Sub would be my preference, since the other green box components are in GCP.

For authentication, I think green box would create a Pub Sub topic and then grant access to a GCP service account created by blue box.

hannes-ucsc commented 6 years ago

Still trying to make this work with pre-signed requests. But I googled intensely and cannot find any way to pre-sign a Publish request to either SQS, SNS or PubSub where the requester can vary the message, which is key here. The SQS and SNS APIs are old and not very RESTy: all parameters go into the body of the POST and HMAC is used to sign them all, including the message body. I don't think GCP uses HMAC but just token auth over TLS but I am ridiculously oblivious of auth in general so take this with a grain of salt.

I propose we hack direct SQS delivery for now by way of a special proprietary URL scheme, say, dss+https://us-west-2.queue.amazonaws.com/123456789123/green-queue.fifo and settle auth via IAM profiles. You'd be responsible for setting up the queue and authorizing either our dss-notify lambda (or dss-index, not sure yet) to push to that queue via IAM.

hannes-ucsc commented 6 years ago

Didn't see you post until now. But yes, the same proprietary URL hack would work for PubSub.

kislyuk commented 6 years ago

I don't think we should build this into the DSS API. I think we should stay implementation agnostic and keep the general purpose webhook push notifications as the primary delivery mechanism. However, we could easily provide a reference implementation of a relay service (Lambda or GCF-based) that would take these notifications and forward them to an SNS, SQS, or PubSub queue.

mweiden commented 6 years ago

@dshiga I'm going to second @kislyuk's call here for a simple GCF or Lambda to translate the webhook pushes into SNS/SQS/PubSub messages. We can reinvestigate this once we are past pilot phase and if we have more clients calling for queueing functionality.

Let's change the scope of this issue to providing a reference implementation for GCP PubSub.