Open chrismeyersfsu opened 2 years ago
We have learned that the UI doesn't ACTUALLY need the job events over the websockets to operate correctly. What the UI needs is a periodic update of the NUMBER of events that have been generated.
I actually did start using this data as I added the expand/collapse behavior (https://github.com/ansible/awx/pull/11312) — but if there are performance reasons to change this, we can revisit that.
However, on the Job Output page, the UI only uses a few pieces of data from job events: the stdout, counter, uuid, and parent_uuid (off the top of my head, there may be a couple other items used) -- we can probably trim a lot of the rest out, especially over the websocket.
We have learned that the UI doesn't ACTUALLY need the job events over the websockets to operate correctly. What the UI needs is a periodic update of the NUMBER of events that have been generated.
I do not quite understand. Isn't the number of events the counter
? Could someone elaborate on what exactly the UI is looking for here?
See also: https://github.com/ansible/awx/issues/11486
We could explore moving the useWebsocket
to a context, and re-use the same client and connection in different parts of the application.
This may need to be further broken down....TBD....to do possibly sprint 3?
Some thoughts on this
There is a different backend for django channels, which uses a redis PubSub implementation. It's considerably faster than the default redis channels (peak throughput of 3000 msg per second, as opposed to 300 per second in my own testing). It's still experimental so not sure if we can trust it in production.
Still, 3000 per second is probably not enough. Jobs are throttled to send 30 event per second, so 100 simultaneous jobs would max out a PubSub backend. Worse -- we currently send all events to ALL other control nodes. We do this even if there are no clients listening for those events.
Ideally we'd only send events if we knew there were listeners on the other end. The PubSub implementation almost does this for us, because this publish
call will return the number of subscribers for that channel
Publish message on channel. Returns the number of subscribers the message was delivered to.
That would allow us to do something like
num_subscribers = consumer.send(event)
if num_subscribers == 0:
# don't send anymore for a while
but as you can see there is no return value in the call to .publish() (likely to keep PubSub consistent with the default redis channels implementation)
Another consideration is using a single Redis instance backend for all control nodes. This eliminates the need for our websocket broadcast backplane, as the single redis instance will correctly push the events to the right client for us. Having a single redis instance (plus maybe sharding) is the recommended way to scale django channels
https://channels.readthedocs.io/en/1.x/deploying.html#scaling-up
Another consideration is using a single Redis instance backend for all control nodes.
We do have node-local things we use redis for. Are you thinking we could run another instance of redis for this purpose?
yep we'd still have a redis instance per control node for django caching and other local use cases.
Today, the question was posed: "Can we track if we have any clients listening for certain groups, and only send events for groups that have clients?"
From what @fosterseth said, we would need to broadcast to other nodes what clients each has.
Websockets
An
event
is the data artifact of an Ansible callback call. This event flows from Ansible playbooks -> runner -> receptor -> redis -> callback receiver -> (postgres, external logger, websocket)websocket -> (all other Controller control nodes, all subscribed websocket clients)
All events go to all other Controller nodes & out to all clients that are subscribed to the job. Controller does not filter based on event type or anything like that. This is a problem IF the receiver assumes these events sent over the websockets are reliable and they are not. This becomes apparent as the rate at which we create websockets exceeds our capacity to deliver them. This is what is happening as we build out more and more scale features in Controller. We allow for running more jobs in parallel, which creates more events per second, which requires the websockets to send more events per second, and requires the UI to display more events per second.
But it doesn't have to be this way. We need to decide how we are going to solve this problem.
1. Increase number of websockets we can reliably handle
This section looks at solving the websocket problem through a performance lense. Since it's a performance problem, we should do all the classic performance things.
A. Agree on a target workload we want the system to support B. Quantify the current performance C. Establish the maximum performance the current architecture, technologies
The solution may require reworking the websocket subsystem. We use the term "rework" instead of "rewrite" because we don't want to restrict ourselves to replacing the existing system. For example, we should consider sharding as a solution to meet the target workload requirement(s)
A. Agree on a target workload we want the system to support
This is our target. It's hard to hit a moving target so let's fix it.
Let's work backwards. How many events per second are going to be created? An event is only created after it is saved into Postgres. Postgres is our bottleneck. How many events per second can Postgres save? 30,000 How did we get that number? Some experimentation a long time ago. The number could use some updating. It will surely have changed after the DB partitioning work. I believe it is heavily dependent on the number of indexes.
Presume 30,000 events/s max
Working backwards even more. Let's look at a real customer Postgres events/s workload that I have seen.
Presume 4,200 events/s max
TODO: Is this even doable from a networking bandwidth perspective? (i.e. 1GB/s, 10GB/s, 100GB/s)
We have now presented two potential targets. The first, is the maximum Postgres can generate. The second, is the maximum we have seen in the real world. A 3rd target would consider optimizing our event insertion rate to get Postgres to, say, 100,000 events per second.
The 30,000 per second is what we will target in this document.
B. Quantify the current performance
TODO: We at least know this should be less than C. Doing this work would further validate the work done in C. i.e. if B performance is > C. then we did B or C wrong.
C. Establish the maximum performance the current architecture, technologies
Let's start by looking at the maximum performance using the current Technology i.e. channels.
Technology Maximum
=~ 200 events per second max to a single websocket client. The response time from job event epoch to the time it reaches the browser also begins to grow as you hit this threshold. The response time grows because the events are being queued. As the queue grows, it hits a limit. When the limit is hit, new websocket events are dropped.
https://github.com/chrismeyersfsu/channels_redis_debug
Architecture Maximum
Under our current architecture we require sending all events to all servers. The websocket backplane is fully connected and for any event we send to N-1 other servers. The table below are the bandwidth requirements to send the events across the websocket backplane.
30,000 events/s @ 2kb & Bandwidth Requirements
2. Do not send events over websockets
There are two "modes", Job is believed to be running vs. the job is known to not be running.
We have learned that the UI doesn't ACTUALLY need the job events over the websockets to operate correctly. What the UI needs is a periodic update of the NUMBER of events that have been generated.
Off the top of my head. The API can generate this count periodically.
Can we just completely drop sending job events over the websockets? The UI doesn't NEED them & can operate without them. In fact, it would reduce UI complexity to drop this code. They code today around this terrible API deficiency.
3. Hybrid. Deliver a subset of events reliably over websockets.
This path would allow us to build feature(s) in the UI that rely on displaying a subset of job events. Note that a feature that requires a subset of the events doesn't exist today so we would be building for an unknown future feature here.
The hard engineering problem here is quantifying the subset that the websocket subsystem can reliably delivery. This solution boarders very closely the work in 1. above.