LeonHartley / Coerce-rs

Actor runtime and distributed systems framework for Rust
714 stars 23 forks source link

Coerce performance #15

Open pranaypratyush opened 1 year ago

pranaypratyush commented 1 year ago

I ran benchmarks provided in coerce crate on my 5950x and got this

    Running benches/actor_creation.rs (/home/pranay/scratch/wd/Coerce-rs/target/release/deps/actor_creation-27b48128a604a745)

running 1 test
test create_1000_actors ... bench:   6,488,134 ns/iter (+/- 3,843,611)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured

     Running benches/actor_messaging.rs (/home/pranay/scratch/wd/Coerce-rs/target/release/deps/actor_messaging-0aba58e833c97004)

running 2 tests
test actor_notify_1000_benchmark ... bench:     151,532 ns/iter (+/- 136,179)
test actor_send_1000_benchmark   ... bench:   4,892,811 ns/iter (+/- 3,425,463)

Quite surprised that it takes so much time. I am trying to build a social network where each post can be an orderbook, so a lot of orderbooks. I liked Coerce's API compared to something like Bastion, but this benchmark surprised me. Is this going to be representative latencies in the final web server or is this just happening because we are awaiting one after the other on a multi-threaded runtime, and Tokio is wasting too much time in the scheduler doing nothing useful at all?

pranaypratyush commented 1 year ago

I made naive benchmarks to compare performance with Bastion over here: https://github.com/pranaypratyush/actor_bench_test

Currently, I am getting the following

Benchmarking Bastion/actor_creation: Collecting 100 samples in estimated 5.0053 s (975k iBastion/actor_creation  time:   [5.0795 µs 5.2497 µs 5.5823 µs]
                        change: [+0.1855% +2.8083% +5.4573%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking Bastion/send_message: Collecting 100 samples in estimated 5.0068 s (2.7M iteBastion/send_message    time:   [1.8115 µs 1.8151 µs 1.8199 µs]
                        change: [+5.2196% +5.9013% +6.7632%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

     Running benches/coerce_benches.rs (target/release/deps/coerce_benches-0a223657393e94b9)
Benchmarking actor_send_1000: Collecting 100 samples in estimated 5.1943 s (1700 iteratioactor_send_1000         time:   [2.9886 ms 3.0266 ms 3.0693 ms]
                        change: [-50.216% -49.388% -48.563%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  6 (6.00%) high mild
  12 (12.00%) high severe

Benchmarking actor_notify_1000: Collecting 100 samples in estimated 5.2457 s (30k iteratiactor_notify_1000       time:   [229.26 µs 249.90 µs 266.92 µs]
                        change: [-19.567% -12.271% -4.6701%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking create_1000_actors: Collecting 100 samples in estimated 5.1056 s (1000 iteracreate_1000_actors      time:   [5.0030 ms 5.1304 ms 5.2710 ms]
                        change: [+2.2920% +4.8187% +7.9480%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Please note that Bastion is doing 1 iteration whereas Coerce is doing 1000, Bastion is taking way too much memory doing much lesser and doesn't spread the load to all cores, while Coerce looks memory efficient in comparison (not sure how much or is that even sensible efficiency) and spreads the load somewhat evenly. Also that this benchmark is probably flawed.

LeonHartley commented 1 year ago

Hi @pranaypratyush,

The benchmarks included in this repository are in no way indicative of real-world performance, and were added only as a quick and dirty way to detect performance regressions with the Coerce library itself.

The framework (and actor model as a whole) shines when you have many actors working concurrently, rather than just 1 sending and receiving sequentially. I'll look at adding in some better performance benchmarks soon that will give a clearer picture of how Coerce will perform in the real world.

Thanks a lot!

LeonHartley commented 1 year ago

Sorry, didn't mean to close the issue!

pranaypratyush commented 1 year ago

Yes, I am aware that the simple benchmarks you added are too naive to represent anything useful on it's own but are merely there to help you catch some obvious performance regressions. My benchmarks are naive as well but I would keep working on them. Helps me learn I just added the following to my repo

fn actor_send_receive_on_current_thread_1000_benchmark(c: &mut Criterion) {
    // let runtime = rt();

    c.bench_function("actor_send_receive_on_current_thread_1000", |b| {
        b.iter(|| async {
            let local = tokio::task::LocalSet::new();

            let send_receive_1000 = async move {
                let actor = actor().await;

                for _ in 0..1000 {
                    actor.send(Msg).await.unwrap();
                }
            };

            local.spawn_local(send_receive_1000);
            local.await;
        });
    });
}

async fn actor() -> LocalActorRef<BenchmarkActor> {
    let system = ActorSystem::new();
    system
        .new_actor("actor".into_actor_id(), BenchmarkActor, Anonymous)
        .await
        .expect("unable to create actor")
}

And I get this for this bench

actor_send_receive_on_current_thread_1000
                        time:   [3.3284 ns 3.3297 ns 3.3312 ns]
                        change: [-97.464% -97.461% -97.458%] (p = 0.00 < 0.05)

Maybe we can add some thread local stuff in coerce? Or perhaps some better examples of how to systematically use this for hot paths in a real project?

pranaypratyush commented 1 year ago
send_zst/1              time:   [2.3800 µs 2.4019 µs 2.4330 µs]
                        change: [+1.1173% +3.3656% +6.4253%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
send_zst/10             time:   [3.3274 µs 3.3344 µs 3.3418 µs]
                        change: [-26.455% -20.847% -14.633%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
send_zst/100            time:   [56.708 µs 56.788 µs 56.872 µs]
                        change: [-3.1032% -2.9136% -2.7247%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
send_zst/1000           time:   [476.38 µs 476.82 µs 477.27 µs]
                        change: [-12.620% -12.406% -12.209%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

Above is benchmarks from xtra. It also happens to be ridiculously memory efficient.