ManageIQ / miq_bot

ManageIQ Bot
Apache License 2.0
15 stars 39 forks source link

Send Webhooks (all) to Amazon SQS #281

Open chriskacerguis opened 7 years ago

chriskacerguis commented 7 years ago

Looking at the overall landscape of how we want the BOT to work, I think it would work well to have all webhooks sent to Amazon SQS (FIFO Queue). So that (1) we can ensure that if the BOT is down it will still get the messages, (2) we can conserve API calls, (3) we can do more faster processing (since we don't need to worry about API limits, and (4) we can negate the need for a public and private bot instance.

Webhooks should be able to be sent directly to SQS, however if not I would suggest that a simple AWS Lambda script will work.

Costs seem to be minimal: SQS -> $0.50 per 1 million requests after the first 1 million Data -> $0.90 per GB after first GB

IF we need Lambda the cost is: $0.20 per 1 million requests after the first million.

So, the thought would be have all webhooks for all the repos go to SQS, and then the BOT can pull from SQS at a time period that we select, I would go with every 3 minutes.

Fryguy commented 7 years ago

@chrisarcand @bdunne Thoughts?

Fryguy commented 7 years ago

So initial thoughts from me thinking about it this morning (that you haven't already answered)

can pull from SQS at a time period that we select, I would go with every 3 minutes.

I thought part of the reason for moving to webhooks was more "realtime" responses, but this changes that, right? If we are polling SQS versus Notifications API, what is the difference (aside from API rate limits, but @chrisarcand has already identified places where we can improve that greatly).

chriskacerguis commented 7 years ago

We can go faster than 3 minutes, I just picked that as it is fairly fast but not so much so it would cause the BOT to have issues (I don't know anything about the server the BOT runs on). We don't necessarily know how much data the BOT will pull down, and so we want to make sure that we don't crush the server when the BOT is processing. Ideally, I'd love to see it constantly checking every few seconds.

As far as other benefits, here's what I see:

Fryguy commented 7 years ago

Overall I like the idea, and I don't think it will be that much more invasive, code wise. Do you have some thoughts on implementation? My initial thoughts are:

NickLaMuro commented 7 years ago

My quick 2 cents / questions:

I think my main point is the third one. I think webhooks in general are a fine future approach and probably should be a focus, but it seems like this is just muddying the water a bit to get there. Mostly what we need for webhooks to work is to get a public facing (hopefully auth'd) endpoint to puts jobs into our queue backend (currently redis). Swapping out redis for SQS to do that, in my humble opinion, doesn't provide any benefit unless it somehow gives us a public endpoint for free.

Some other things of note:

Fryguy commented 7 years ago

How would developers be able to test features with when we are using an external queueing service?

That is really interesting...I hadn't thought of that at all. I'd assume that each dev would have to set up an SQS for their own repos/orgs, and we'd have to doc that.

chrisarcand commented 7 years ago

That is really interesting...I hadn't thought of that at all. I'd assume that each dev would have to set up an SQS for their own repos/orgs, and we'd have to doc that.

For any OSS contributor (of which there are probably none), yes. For Red Hatters working on the bot, we can have a development environment SQS queue that anyone can use. I (and @NickLaMuro) have done this in the past.

Fryguy commented 7 years ago

Mostly what we need for webhooks to work is to get a public facing (hopefully auth'd) endpoint to puts jobs into our queue backend (currently redis)

The main idea for SQS came from my concern that if we miss a webhook how would we handle that. Without something "always up", missing an event could be bad. For example, if we miss the "new PR" event, we won't be able to track it at all, or we'd have to code in some clunky workarounds where if we see a "update PR" event, but don't have one, we have to create it. The SQS idea avoids that with an "always up" service.

NickLaMuro commented 7 years ago

@chriskacerguis I think what you described as your three points for switching SQS are counter pointed by what I described above, but I will say this more directly here:

I think what you are looking for is not a switch of the datastore, but a high-availability public-facing api frontend for the bot to process webhooks (basically, dump the webhooks into the proper queue in our datastore), and I don't think swapping Redis for SQS solves this terribly efficiently.

I think separating this API into it's own process would be simple enough, and modifying our personal deploy scripts so this doesn't get restarted whenever the bot is deployed, or making it restart in a way were we do rolling restarts is all that is really necessary here (and making it public facing of course).

Two projects I quickly found do this already (this is for SQS specifically, but is furthering my point I made above):

And they both effectively are that web-app frontend that I described we would need to build. The other "Amazon Method" for doing this also adds Lambda (aka: #moreMoney™) to the equation:

And this is the only place that I have see someone trying to post to a SQS queue directly:

http://stackoverflow.com/questions/26041847/github-webhook-not-posting-to-amazon-sqs-queue-due-to-unknownoperationexception

And he wasn't able to fill in the API with the message data directly. Also, if we were to implement this via a stackoverflow post, we were probably doing it wrong.

Fryguy commented 7 years ago

Re: http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/limits-messages.html

By default, a message is retained for 4 days. The minimum is 60 seconds (1 minute). The maximum is 1,209,600 seconds (14 days).

If the bot is down for a weekend that might become an issue with the default. Seems less likely, but still. I don't understand the minimum and maximum...are those configurable? If so, why wouldn't you always just pick 14 days?

The minimum message size is 1,024 bytes (1 KB)

Could this be a problem? (I have no idea what payload sizes are)

The maximum is 262,144 bytes (256 KB).

Same question

The maximum visibility timeout for a message is 12 hours.

Don't understand this as compared to retention value above.

chrisarcand commented 7 years ago

@Fryguy I'm just going to generally say "I wouldn't worry about that." to all your concerns above, tbh.

bdunne commented 7 years ago

While I'm not against webhooks, my concerns about using them are:

Implementation concerns for webhooks via Amazon SQS:

The idea as a whole to me just sounds like polling notifications from Github vs polling Github notifications from somewhere else (just saves Github API calls). I think we are still far enough away from being able to consume webhook notifications and do anything useful with them that I am not too concerned about how it is implemented. However, this is a great place to brainstorm.

Overall, I think there are lots of quicker wins to free up API calls. (which @chrisarcand is looking into) We never bothered in the past because we had many more than we needed and preferred to spend time adding more value.

In the past @Fryguy and I discussed splitting up the current notification monitor to a reader and one or more writers and that would be a pre-req. This would also allow for other interaction sources (gitter, IRC, e-mail, twitter, etc.)

chriskacerguis commented 7 years ago

Okay, lots of stuff to respond to (so if I miss something apologies):

@NickLaMuro ...

I don't think there is a direct sidekiq worker that works with SQS directly,

I don't know, but there are plenty of ways to interact with it. We could also make one and then open source it.

How would developers be able to test features with when we are using an external queueing service?

I think someone answered this, so I won't repeat, but essentially we can have 2 duplicate queues (one for prod and one for dev)

Can we publish directly to a queue with a public API?

GitHub has direct support for SQS integration.

I think what you are looking for is not a switch of the datastore, but a high-availability public-facing api frontend

I'm not looking for anything other than my requirements to get met for how my team works. That said, @Fryguy and @chessbyte have brought up a point about availability, and that solves it. The fact is that SQS is an HA solution, and what we have isn't. So, it seems that the best thing would be to let AWS do it.

Profit... oh wait... sorry... expenses... because AWS ain't free

Costs are so minimal it's not even funny. I doubt we would even hit $10 per month. Yes, I saw you were 1/2 joking, but I did want to address that.

@Fryguy ...

If the bot is down for a weekend that might become an issue with the default.

We can use the Dead Letter Queue (DLQ) to address this.

Payload sizes

It shouldn't be an issue.

@bdunne ...

Something public has to be listening for them (since the bot is currently behind a firewall)

That's why we want to use SQS, so the BOT doesn't ned to be public. This actually address all 3 of your points.

What will this cost for a typical month?

I doubt we are doing more than 1 million messages per month

How much maintenance will there be?

If using SQS, it should be done. GitHub integrates with SQS, so we just have the BOT to contend with.

How will the service be monitored (to ensure uptime)?

AWS touts SQS as HA. That said, we can have Nagios monitor it, if needed.

The idea as a whole to me just sounds like polling notifications from Github vs polling Github notifications from somewhere else (just saves Github API calls)

Really, yes that's what we are doing. I don't agree with your statement about conserving API calls. Even with optimizations which I KNOW @chrisarcand will do, the fact is that we are adding more repos, and more people contributing and that increases the amount of calls that we will make. From an architecture perspective this is the right thing to do for the long run. We will have to do this at some point. So, I think that both optimizing the calls that we make the the API (I'm sure we will still have to make calls) and using webhooks we will be making the BOT much more scaleable.

chrisarcand commented 7 years ago

From an architecture perspective this is the right thing to do for the long run.

Agreed

NickLaMuro commented 7 years ago

@chriskacerguis

Can we publish directly to a queue with a public API?

GitHub has direct support for SQS integration.

Where? Besides the doc that I provided in https://github.com/ManageIQ/miq_bot/issues/281#issuecomment-276790362 that is from Amazon to do it using Lambda, I have been unable to find documentation by github describing this.

All of the links from the aforementioned comment are from me doing a google search of github webhook sqs, almost in order of appearance. And searching the help docs directly:

https://help.github.com/search/?utf8=%E2%9C%93&q=SQS+webhook

Returns nothing.

AGAIN: I am not saying I disagree with the notion of webhooks (in fact, I have said I agree with it on multiple occations), but I think using SQS a vehicle to get us there is a roundabout and costly way of doing it (developer time and direct cost).

bdunne commented 7 years ago

Again, I'm not against eventually using webhooks in the bot because I do see the benefits, but...

@chriskacerguis The issue title and description describe a possible implementation of a solution to a problem. Can we step back and define the problem that we're trying to solve either in the title & description of this issue or in a new issue if you just want to discuss SQS here? Is it just the rate limit on the Github API that we recently hit? I thought the webhooks discussion started before the rate limit discussion, so I may be missing something.

I'm not looking for anything other than my requirements to get met for how my team works.

It may not be appropriate to expand in this issue itself (due to a potentially larger scope), but are the requirements of your team defined somewhere that we can link to?

chriskacerguis commented 7 years ago

@NickLaMuro ...

You can find it in Integrations and Services (see screenshot).

I think using SQS a vehicle to get us there is a roundabout and costly way of doing it (developer time and direct cost)

My thinking is this...

Costs for AWS:

At that point, you are done with the feature (unless we add other things)

Costs for "in house":

are the requirements of your team defined somewhere that we can link to?

Yes and no, I have added 90% of what we need to the BOT issues area. But, one big area we need to address (for my team, and really I think the whole team) is the syncing of all our "tracking" tools (BZ, Pivotal, GH Issues, etc), and for this we will need a new architecture (and I think part of this is a queue system).

chriskacerguis commented 7 years ago

Screenshot for real this time.

screen shot 2017-02-01 at 4 33 33 pm
NickLaMuro commented 7 years ago

@chriskacerguis Thanks, more info copied from that page:

Install Notes

The Amazon SQS service allows GitHub to send a notification to an Amazon SQS queue.

Configure your Amazon AWS account with an appropriately set up access-key/secret-key (Either parent account or IAM) that has permissions to perform 'SendMessage' operations. (https://console.aws.amazon.com/sqs/)

Amazon SQS ARN is the unique code SQS references your queue with. It can be copied from the console or fetched through the API. It takes the form of arn:aws:sqs:us-(region)-(dc):(id):(name). Copy and paste the entire string. This hook parses the necessary details from the arn. You may also specify a SQS Queue URL instead of an ARN if you want to use IAM or a Queue that was created using a different account.

This is more for me than anything since there don't seem to be docs for this specifically, or what webhooks are included by this (I don't have experience with github integrations personally, so I am going off documentation).

Fryguy commented 7 years ago

In private message @chriskacerguis wrote: using SQS is one less thing we need to maintain / code.

That's funny because I was just about to comment the opposite :) I interpret the SQS thing as one more thing we need to maintain...

However, before I jump ahead comparing the two, I wanted to mention that the whole reason this discussion started was because I was concerned a direct webhook interface might miss webhooks when the bot goes down for update or maintenance or anything. If there is a failed notifications thing we can access, then I think my concern is moot, and thus so is the SQS idea. The major upside of direct Rails over SQS is instant feedback. SQS would still require us to poll, which isn't very different from how it is now.


SQS introduces another dependency that needs to be installed (dev needs to create one for example, or know to connect to an existing remote instance), and we need something to talk to it, so we have to introduce a new gem (aws-sdk) and code to use the gem.

Rails endpoint does not constitute a new dependency as we already use Rails, and it's just a normal controller action (so still needs code, but no "new" dependencies). Devs will probably have to use ngrok.io or something for testing locally. If we can do a failed notification thing for GitHub, we'll probably have to code for that too, which will not be necessary with SQS.

Creating the webhook itself is a wash since both options need to create one.

Fryguy commented 7 years ago

failed notification for GitHub

@chrisarcand @NickLaMuro I believe I heard or read one of you saying this exists, but I cannot find it. Do you have links to information?

NickLaMuro commented 7 years ago

@Fryguy second point from here: https://github.com/ManageIQ/miq_bot/issues/281#issuecomment-276785310

Was about to comment about there still being a maintenance cost with permissioning SQS when people need access or access needs to be revoked, or requiring devs to signup for SQS themselves causing a barrier to entry for everyone working on the bot.

NickLaMuro commented 7 years ago

Sorry, I missed read your comment.

I have tested with a spike I did with webhooks last night and there is a picture I shared (privately) where this comment was able to be re-delivered (even if it received a 200 from the endpoint), but I want to look over it and make sure it doesn't have any personal information in the image before posting it here.

chrisarcand commented 7 years ago

Was about to comment about there still being a maintenance cost with permissioning SQS

Yes. I do not consider the idea of "SQS outsources the need for Ops, we're a software shop" as valid. AWS has it's own set of overhead, even while not having to get on a console and manage a queue yourself.

chessbyte commented 7 years ago

However, before I jump ahead comparing the two, I wanted to mention that the whole reason this discussion started was because I was concerned a direct webhook interface might miss webhooks when the bot goes down for update or maintenance or anything. If there is a failed notifications thing we can access, then I think my concern is moot, and thus so is the SQS idea. The major upside of direct Rails over SQS is instant feedback. SQS would still require us to poll, which isn't very different from how it is now.

What is the failed notification thing?

NickLaMuro commented 7 years ago

@chessbyte There is a interface for "Recent Deliveries" for github webhooks that exists that allows you to see the success/failure, request/response, and even retry webhook deliveries for a given webhook definition. How far back in history that goes I have to do some research/ask friends about.

I don't see that as a replacement for "HA", but a tool for troubleshooting for when something inevitably, goes wrong (regardless of the SQS integration implementation or using webhooks with a bot endpoint).

Screenshots (ignore the Dark Theme... changes to github UI through a browser plugin... for my "Special Eyes"):

screen shot 2017-02-02 at 10 15 34 am cropped screen shot 2017-02-02 at 10 17 56 am
NickLaMuro commented 7 years ago

Screenshots also exist in the docs: https://developer.github.com/webhooks/testing/

miq-bot commented 4 years ago

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the traige process documentation.