chrismeyersfsu commented 2 years ago

Websockets

An event is the data artifact of an Ansible callback call. This event flows from Ansible playbooks -> runner -> receptor -> redis -> callback receiver -> (postgres, external logger, websocket)

websocket -> (all other Controller control nodes, all subscribed websocket clients)

All events go to all other Controller nodes & out to all clients that are subscribed to the job. Controller does not filter based on event type or anything like that. This is a problem IF the receiver assumes these events sent over the websockets are reliable and they are not. This becomes apparent as the rate at which we create websockets exceeds our capacity to deliver them. This is what is happening as we build out more and more scale features in Controller. We allow for running more jobs in parallel, which creates more events per second, which requires the websockets to send more events per second, and requires the UI to display more events per second.

But it doesn't have to be this way. We need to decide how we are going to solve this problem.

Increase number of websockets we can reliably handle
Do not send events over websockets. A periodic count is all the UI NEEDS.
Hybrid. Deliver a subset of events reliably over websockets.

1. Increase number of websockets we can reliably handle

This section looks at solving the websocket problem through a performance lense. Since it's a performance problem, we should do all the classic performance things.

A. Agree on a target workload we want the system to support B. Quantify the current performance C. Establish the maximum performance the current architecture, technologies

The solution may require reworking the websocket subsystem. We use the term "rework" instead of "rewrite" because we don't want to restrict ourselves to replacing the existing system. For example, we should consider sharding as a solution to meet the target workload requirement(s)

A. Agree on a target workload we want the system to support

This is our target. It's hard to hit a moving target so let's fix it.

Let's work backwards. How many events per second are going to be created? An event is only created after it is saved into Postgres. Postgres is our bottleneck. How many events per second can Postgres save? 30,000 How did we get that number? Some experimentation a long time ago. The number could use some updating. It will surely have changed after the DB partitioning work. I believe it is heavily dependent on the number of indexes.

Presume 30,000 events/s max

Number of Clients	Websocket Events/s Required
1	30,000
2	60,000
4	120,000
8	240,000
16	480,000
32	960,000
64	1,920,000
128	3,840,000

Working backwards even more. Let's look at a real customer Postgres events/s workload that I have seen.

Presume 4,200 events/s max

Number of Clients	Websocket Events/s Required
1	4,200
2	8,400
4	16,800
8	33,600
16	67,200
32	134,400
64	268,800
128	537,600

TODO: Is this even doable from a networking bandwidth perspective? (i.e. 1GB/s, 10GB/s, 100GB/s)

We have now presented two potential targets. The first, is the maximum Postgres can generate. The second, is the maximum we have seen in the real world. A 3rd target would consider optimizing our event insertion rate to get Postgres to, say, 100,000 events per second.

The 30,000 per second is what we will target in this document.

B. Quantify the current performance

TODO: We at least know this should be less than C. Doing this work would further validate the work done in C. i.e. if B performance is > C. then we did B or C wrong.

C. Establish the maximum performance the current architecture, technologies

Let's start by looking at the maximum performance using the current Technology i.e. channels.

Technology Maximum

=~ 200 events per second max to a single websocket client. The response time from job event epoch to the time it reaches the browser also begins to grow as you hit this threshold. The response time grows because the events are being queued. As the queue grows, it hits a limit. When the limit is hit, new websocket events are dropped.

https://github.com/chrismeyersfsu/channels_redis_debug

Architecture Maximum

Under our current architecture we require sending all events to all servers. The websocket backplane is fully connected and for any event we send to N-1 other servers. The table below are the bandwidth requirements to send the events across the websocket backplane.

30,000 events/s @ 2kb & Bandwidth Requirements

tower nodes	Bandwidth GB/s
2	0.9155273438
4	1.831054688
8	3.662109375
16	7.32421875
32	14.6484375
64	29.296875
128	58.59375

2. Do not send events over websockets

There are two "modes", Job is believed to be running vs. the job is known to not be running.

We have learned that the UI doesn't ACTUALLY need the job events over the websockets to operate correctly. What the UI needs is a periodic update of the NUMBER of events that have been generated.

Off the top of my head. The API can generate this count periodically.

Can we just completely drop sending job events over the websockets? The UI doesn't NEED them & can operate without them. In fact, it would reduce UI complexity to drop this code. They code today around this terrible API deficiency.

3. Hybrid. Deliver a subset of events reliably over websockets.

This path would allow us to build feature(s) in the UI that rely on displaying a subset of job events. Note that a feature that requires a subset of the events doesn't exist today so we would be building for an unknown future feature here.

The hard engineering problem here is quantifying the subset that the websocket subsystem can reliably delivery. This solution boarders very closely the work in 1. above.

keithjgrant commented 2 years ago

We have learned that the UI doesn't ACTUALLY need the job events over the websockets to operate correctly. What the UI needs is a periodic update of the NUMBER of events that have been generated.

I actually did start using this data as I added the expand/collapse behavior (https://github.com/ansible/awx/pull/11312) — but if there are performance reasons to change this, we can revisit that.

However, on the Job Output page, the UI only uses a few pieces of data from job events: the stdout, counter, uuid, and parent_uuid (off the top of my head, there may be a couple other items used) -- we can probably trim a lot of the rest out, especially over the websocket.

AlanCoding commented 2 years ago

We have learned that the UI doesn't ACTUALLY need the job events over the websockets to operate correctly. What the UI needs is a periodic update of the NUMBER of events that have been generated.

I do not quite understand. Isn't the number of events the counter? Could someone elaborate on what exactly the UI is looking for here?

nixocio commented 2 years ago

We could explore moving the useWebsocket to a context, and re-use the same client and connection in different parts of the application.

alisondudiak commented 2 years ago

This may need to be further broken down....TBD....to do possibly sprint 3?

fosterseth commented 2 years ago

Some thoughts on this

PubSub

There is a different backend for django channels, which uses a redis PubSub implementation. It's considerably faster than the default redis channels (peak throughput of 3000 msg per second, as opposed to 300 per second in my own testing). It's still experimental so not sure if we can trust it in production.

Still, 3000 per second is probably not enough. Jobs are throttled to send 30 event per second, so 100 simultaneous jobs would max out a PubSub backend. Worse -- we currently send all events to ALL other control nodes. We do this even if there are no clients listening for those events.

Ideally we'd only send events if we knew there were listeners on the other end. The PubSub implementation almost does this for us, because this publish call will return the number of subscribers for that channel

https://github.com/django/channels_redis/blob/bba93196d8fe5e5fbfc470350c1f3da168c56739/channels_redis/pubsub.py#L157

Publish message on channel. Returns the number of subscribers the message was delivered to.

https://redis-py.readthedocs.io/en/stable/commands.html?highlight=publish#redis.commands.cluster.RedisClusterCommands.publish

That would allow us to do something like

num_subscribers = consumer.send(event)
if num_subscribers == 0:
    # don't send anymore for a while

but as you can see there is no return value in the call to .publish() (likely to keep PubSub consistent with the default redis channels implementation)

Single Redis instance

Another consideration is using a single Redis instance backend for all control nodes. This eliminates the need for our websocket broadcast backplane, as the single redis instance will correctly push the events to the right client for us. Having a single redis instance (plus maybe sharding) is the recommended way to scale django channels

https://channels.readthedocs.io/en/1.x/deploying.html#scaling-up

AlanCoding commented 2 years ago

Another consideration is using a single Redis instance backend for all control nodes.

We do have node-local things we use redis for. Are you thinking we could run another instance of redis for this purpose?

fosterseth commented 2 years ago

yep we'd still have a redis instance per control node for django caching and other local use cases.

kdelee commented 2 years ago

Today, the question was posed: "Can we track if we have any clients listening for certain groups, and only send events for groups that have clients?"

From what @fosterseth said, we would need to broadcast to other nodes what clients each has.

ansible / awx

Websockets Tech Debt #11346

Websockets

1. Increase number of websockets we can reliably handle

A. Agree on a target workload we want the system to support

B. Quantify the current performance

C. Establish the maximum performance the current architecture, technologies

Technology Maximum

Architecture Maximum

2. Do not send events over websockets

3. Hybrid. Deliver a subset of events reliably over websockets.

PubSub

Single Redis instance