akhenry / openmct-yamcs

Open MCT YAMCS plugin
14 stars 9 forks source link

Add load tester for YAMCS to simulate Open MCT traffic #388

Closed scottbell closed 11 months ago

scottbell commented 1 year ago

Summary

Using K6, or by writing a simple Node script, simulate 300 WebSocket clients subscribing to 10Hz telemetry. What is the impact on YAMCS? What happens if you make the clients slow to service the WebSocket messages, simulating a browser under heavy load? Does the CPU utilization scale up linearly, or is there a threshold at which it suddenly jumps up?

scottbell commented 1 year ago

~@akhenry @unlikelyzero It looks like the YAMCS Quickstart only fires at 1Hz. Do you guys have another lead on 10Hz data?~

@unlikelyzero had a smart idea of just running the simulator.py 10 times, which gives us 10Hz. It works!

scottbell commented 1 year ago

@unlikelyzero says to run for an hour before shutting down.

akhenry commented 1 year ago

~@akhenry @unlikelyzero It looks like the YAMCS Quickstart only fires at 1Hz. Do you guys have another lead on 10Hz data?~

@unlikelyzero had a smart idea of just running the simulator.py 10 times, which gives us 10Hz. It works!

I've for sure had luck simulating 10Hz data by modifying this line to be sleep(0.1)

scottbell commented 1 year ago

~@akhenry @unlikelyzero It looks like the YAMCS Quickstart only fires at 1Hz. Do you guys have another lead on 10Hz data?~ @unlikelyzero had a smart idea of just running the simulator.py 10 times, which gives us 10Hz. It works!

I've for sure had luck simulating 10Hz data by modifying this line to be sleep(0.1)

Ah, duh. I've done this before too and had forgotten there's a separate sleep above. Thanks for the info!

scottbell commented 1 year ago

Adding a 2s "digestion delay" for the websocket callback absolutely kills YAMCS: pegged memory

and the memory stays pretty high too, even post client disconnect, though restarting YAMCS resolves it.

scottbell commented 1 year ago

If one comments out this:

      webSocket:
        writeBufferWaterMark: { low: 32768, high: 160000000 }

it causes a great deal of these messages:

16:05:21.647 _global [45] WebSocketServerMessageHandler Channel full, cannot write message with priority=NORMAL (slow network?). Closing connection.

but the CPU/memory consumption of YAMCS remains constant. So perhaps writeBufferWaterMark is tuned too high for the YAMCS server?

The defaults are:

{ low: 32768, high: 131072 }

Water marks for the write buffer of each WebSocket connection. When the buffer is full, messages are dropped. High values lead to increased memory use, but connections will be more resilient against unstable networks (i.e. high jitter). Increasing the values also help if a large number of messages are generated in bursts. The map requires keys low and high indicating the low/high water mark in bytes.

akhenry commented 1 year ago

@scottbell Beautiful! These are great findings. The error message is a symptom of a self defense mechanism against slow clients which we are effectively disabling by using arbitrarily large write buffers. Open MCT can handle being dropped, it will just reconnect. Also, if we get dropped for being too slow that's useful feedback that allows us to direct our optimization efforts. There's stuff we can do in Open MCT to make it process WebSocket messages quicker so the buffer doesn't back up, such as taking WebSocket handling off the UI thread.

akhenry commented 1 year ago

@scottbell @unlikelyzero Can we use K6 to load real Open MCT clients?

I think the next step is to reproduce this with real Open MCT clients so that we have a test bed for measuring Open MCT changes.

We will need a sufficiently complex Open MCT display with a bunch of plots, LAD Tables, alphanumerics, and condition sets / widgets. @charlesh88 has some scripting to automate building these I believe.

I think it's worth trying to build a real repro of this in Quickstart for a couple reasons:

  1. We can build regression tests that run on our commercial CI environment and don't require NASA resources.
  2. We can provide reproductions to the Space Applications team if we identify Yamcs bottlenecks
  3. We do not interrupt other development work on shared resources.
  4. Folks outside of NASA can potentially contribute to our performance optimization efforts.
scottbell commented 12 months ago

@unlikelyzero @akhenry

There's a K6 browser I think we could use to do this. From what I can tell, we'd need to:

Playwright looks like it also has something similar we could do with Artillery, but I'm not familiar with it.

scottbell commented 12 months ago

@unlikelyzero @akhenry Fiddling with rather modest parameters for K6:

const maxClients = 40;
const workersPerClient = 5;
const digestionTimeInMs = 500;

on their own create a rather slow build in memory consumption. But if I stop the K6 process after 10 minutes, and restart it, YAMCS never really gives up the fat websocket buffers from the previous run (or at least not quickly enough) and quickly run out of memory.

scottbell commented 11 months ago

@akhenry @unlikelyzero I've added a browser test too. I'll let you know what I find testing it out on Open MCT Quickstart.