measuring performance against tungstenite-rs

tdudz commented 2 years ago

i'm working on a project using tungstenite right now, and after profiling my code I noticed it's doing a lot of allocations so i'm looking to replace it with something faster and more lightweight. i put together a quick benchmark of tungstenite and websocket-lite clients reading a bunch of json from a simple server and compared the results, the difference was not as large as I had hoped. do you have any of your own benchmarks i can compare against? or maybe my benchmark was not representative of websocket-lite's actual performance?

criterion code, which i ran with cargo bench:

use std::{
    thread::{sleep, spawn},
    time::Duration,
};

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use tungstenite::connect;
use websocket_lite::ClientBuilder;
use ws_benchmark_experiment::{server, ADDR};

pub fn criterion_benchmark(c: &mut Criterion) {
    spawn(server);

    let mut client_tungstenite = connect(format!("ws://{}", ADDR)).unwrap().0;
    let mut client_lite = ClientBuilder::new(&format!("ws://{}", ADDR))
        .unwrap()
        .connect()
        .unwrap();

    sleep(Duration::from_secs(1));

    let mut group = c.benchmark_group("read latency");

    group.bench_function("tungstenite", |b| {
        b.iter(|| {
            let msg = client_tungstenite.read_message().unwrap();
            black_box(msg);
        })
    });
    group.bench_function("websocket_lite", |b| {
        b.iter(|| {
            let msg = client_lite.receive().unwrap();
            black_box(msg);
        })
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

use std::{
    net::TcpListener,
    thread::{sleep, spawn},
    time::{Duration, Instant},
};

use tungstenite::{accept, connect, Message};
use websocket_lite::ClientBuilder;

pub static ADDR: &'static str = "127.0.0.1:9119";
pub static ITERATIONS: u32 = 10_000_000;
pub static MESSAGE_SIZE: usize = 100;

pub fn server() {
    let server = TcpListener::bind(ADDR).unwrap();

    for stream in server.incoming() {
        spawn(move || {
            let mut websocket = accept(stream.unwrap()).unwrap();
            let payload = Message::Text(
                String::from(r#"{ "foo": "bar", "baz": 123, "quux": false }"#.repeat(MESSAGE_SIZE)), //.as_bytes()
                                                                                                     //.to_vec(),
            );

            loop {
                websocket.write_message(payload.clone());
            }
        });
    }
}

1tgr commented 2 years ago

I think the websocket-lite benchmark should be ok. You should expect zero memory allocations from the profiler loop (that is, after the initial client connect). When developing websocket clients I used https://github.com/KDE/heaptrack to verify this manually. (The idea is that every Message struct has a reference-counted reference back to a single buffer owned by the codec.)

I'll try to find time to run the benchmark through heaptrack and check for any rogue memory allocations.

smabie commented 2 years ago

I guess the bigger point is: if the performance of tungstenite and websocket-lite aren't significantly different, than what's the point of this project?

Perhaps something is off from the benchmark?

1tgr commented 2 years ago

This is an interesting result, thanks for putting together the benchmark.

The initial focus of websocket-lite was to have zero memory allocations after initialisation. Allocations were inherent in the tungstenite design at the time (you have to call into_text or into_vec, which drop the Message), and malloc and free still appear on the tungstenite benchmark whereas they do not appear when calling websocket-lite.

Even without the overhead of memory allocations, parsing of messages on websocket-lite seems a little slower. Hopefully I can see if the two libraries are doing anything different.

1tgr commented 2 years ago

I think the timings are close because they're dominated by the network itself, even on localhost; that is, the recv() call shows up in the profiler as the hottest function.

In terms of elapsed time, the two libraries are close for me, with websocket-lite marginally faster when you include the recv() call (which is what the criterion output shows).

With recv() removed my profiler is showing websocket-lite being twice as fast as tungstenite. Apologies for not sharing the data, at home I'm using Mac OS Instruments.app and I don't think I have the data in a form I can share easily.

In terms of context - the aim of websocket-lite is to give you a parsed message as quickly as possible, with predictable timings consistent from one message to the next. That is, if we can measure the interval between the last byte of the frame and the return of the client_lite.receive() function, it aims to minimise the mean and the variance of this interval.

1tgr / rust-websocket-lite

measuring performance against tungstenite-rs #302