andrewgazelka commented 2 weeks ago

[!NOTE]
While pointing out problems in architecture is useful, that alone will not allow us to change architecture. Progress can only be made if we find solutions that have fewer trade-offs than what we are currently doing.

Related #350

We are at a point where I think it makes sense to discuss the architecture of the project. Some big points:

should we use ECS
- if yes then which should we use?
- if no then what would we want to use? what do other AAA games use? Just because the AAA game uses it, does it make sense or are they using outdated technology?
- what do other successful Rust games use
should we do world partition parallelism, entity-level parallelism, or system-level parallelism
- or mix
how do we want the API to be?
- do we want the API to be able to be used to create any type of minigame? or do we want something more similar to bukkit API?
- I would love this to be extendable. what is the best way to do this? is there a good way to do this without sacrificing on performance? if there is not then what should we do?

let me know if you can think of any more questions

Current problems

Right now we are using evenio

No system-level parallelism

CleanShot 2024-05-13 at 11 10 01@2x

This is an event to change the player skin. It is rather expensive, does not occur often, and does not usually occur multiple times within the tick. I could see if there are many distinct events that are similar to this, there could be a big perf hit since the systems cannot be parallelised.

Note: the signature is

#[derive(Query)]
pub(crate) struct SetPlayerSkinQuery<'a> {
    packets: &'a mut Packets,
    uuid: &'a Uuid,
    username: &'a InGameName,
}

#[instrument(skip_all, level = "trace")]
pub fn set_player_skin(r: ReceiverMut<SetPlayerSkin, SetPlayerSkinQuery>, compose: Compose) {

This only acts upon that targeted entity. This seems like it would be a great case for parallelization. Note: Compose is defined as

#[derive(HandlerParam, Copy, Clone)]
pub struct Compose<'a> {
    pub compressor: Single<'a, &'static Compressors>,
    pub scratch: Single<'a, &'static Scratches>,
    pub global: Single<'a, &'static Global>,
}

Entity-level parallelism requires often not having targetted events

Look at all these calls to the compass system

CleanShot 2024-05-13 at 11 17 41@2x

We have

#[derive(Event)]
pub struct PointCompass {
    #[event(target)]
    pub target: EntityId,
    pub point_to: BlockPos,
}

#[instrument(skip_all, level = "trace")]
pub fn compass(r: Receiver<event::PointCompass, &mut Packets>, compose: Compose) {
    let event = r.event;

    let packets = r.query;

    let pkt = play::PlayerSpawnPositionS2c {
        position: event.point_to,
        angle: 0.0,
    };

    packets.append(&pkt, &compose).unwrap();
}

again because of no system-level parallelism we run into an issue where we would need to bulk these to perform them in parallel. This is sad because targeted events could potentially be a super big performance improvement for being able to parallelize things.

The new performant event would be

pub struct PointCompass {
    pub target: EntityId,
    pub point_to: BlockPos,
}

#[derive(Event)]
pub struct PointCompassBulk(Vec<PointCompass>)

So we can .par_iter_mut on all of the PointCompass elements. I feel there is a better way than this.

Because of the previous section and this section events that are meant to be easily sendable from consumers of the API we would wish do not have to be bulk but when they are not we get something like

CleanShot 2024-05-13 at 11 21 57@2x

It is very clear where the events sent by the consumer of hyperion are.

Possible alternatives

https://github.com/jazzay/flecs-rs

TestingPlant commented 2 weeks ago

If we want to avoid ECS, we could use regular Rust structs and impl methods on them. This seems to be what Bamboo, another Minecraft server implementation, does to handle entities. This also gives us more control over everything.

Ruben2424 commented 2 weeks ago

I have almost no experiance in high performance stuff. So I cannot say much about this.

I never used an ECS before. But when working on the Inventory I really enjoyed working with it. It feels so intuitive.

Maybe it is useful to use a ECS with sysem level parrallelism (or extend evenio). If possible we could use a mix. Something that allows as to add handlers which are not run in parrall if we dont want to pay the cost of synchronisation or the order of execution matters. And still be able add handlers where the order of execution does not matter and which have the required bounds Sync/Send to be executed in parrallel.

But if there us a bettet solution than ECS I think we should try it out.

rj00a commented 2 weeks ago

Regarding the PointCompassBulk event: Just FYI That looks unsound as written if you're exposing that to users because duplicate EntityIds would lead to aliasing of the &mut Packets component when you go to iterate over that in parallel (could use a hash map of course). For a similar reason I don't see a hypothetical automatic system-level parallelism feature being what you want in this particular case. The overhead of synchronizing individual calls to compass would get in your way.

I also want to point out that there are some ways to make "bulk" events more palatable to your API consumers. Here's one approach that comes almost directly from bevy_ecs:

use evenio::prelude::*;

#[derive(HandlerParam)]
struct EventReader<'a, E: 'static> {
    events: Single<'a, &'static mut Events<E>>,
}

impl<'a, E> EventReader<'a, E> {
    pub fn iter(&self) -> impl Iterator<Item = &E> {
        self.events.events.iter()
    }

    pub fn iter_mut(&mut self) -> impl Iterator<Item = &mut E> {
        self.events.events.iter_mut()
    }

    pub fn drain(&mut self) -> impl Iterator<Item = E> + '_ {
        self.events.events.drain(..)
    }
}

#[derive(HandlerParam)]
struct EventWriter<'a, E: 'static> {
    events: Single<'a, &'static mut Events<E>>,
}

impl<'a, E> EventWriter<'a, E> {
    pub fn push(&mut self, event: E) {
        self.events.events.push(event);
    }
}

pub fn init_event<E: 'static>(world: &mut World) {
    // TODO: check that `Events<E>` doesn't already exist. (waiting for #57)
    // Could probably do this automatically in the `HandlerParam` impl of `EventReader` and `EventWriter`.
    let e = world.spawn();
    world.insert(e, Events::<E> { events: vec![] });
}

#[derive(Component)]
struct Events<E> {
    events: Vec<E>,
}

Now users can add an event writer to their system like:

fn my_handler(_: Receiver<Foo>, mut w: EventWriter<PointCompass>) {
    w.push(PointCompass { ... });
}

and you're free to drain the PointCompass events at a later time. Maybe even in parallel with other work via rayon::join.

andrewgazelka commented 2 weeks ago

@rj00a thanks. I like the idea for making bulk events easier. I might do more of this.

Regarding parallelizable bulk targetted events, yes you are right—there could be aliasing. However, I think there are ways to potentially make this safe.

If you have a mapping function $M: I \to T$ where entity id $\in I$ and thread number $\in T$. you can make sure that there is never any aliasing.
- simple way would be entity_id % num_rayon_threads
you can get the total number of rayon threads
you can (hopefully) assume that $M$ will spread over $T$ evenly.

Concrete idea

Perhaps rayon allows your par_iter to specify the particular thread? I think this could be super super helpful and there would not really be any synchronization issues that I can think of.

Further thoughts

The idea mentioned in Concrete idea would probably require

one thread to produce elements and send to other threads
for each thread to iterate over all entity ids and apply function

I am guessing 2 is probably faster but would have to bench probably.

TestingPlant commented 2 weeks ago

How should we structure egress? In my egress rewrite, I currently iterate through each player and call a list of functions to append s2c packets and then send them for each player. Should this use ECS? If each function generating an s2c packet was a system using a Vec for bulk events, it would only be able to start sending data once each players' packets have been encoded. I still need to do benchmarks to see how long encoding takes to see how much of a problem this would be, but I can see it being a potential performance problem.

andrewgazelka commented 2 weeks ago

How should we structure egress? In my egress rewrite, I currently iterate through each player and call a list of functions to append s2c packets and then send them for each player. Should this use ECS? If each function generating an s2c packet was a system using a Vec for bulk events, it would only be able to start sending data once each players' packets have been encoded. I still need to do benchmarks to see how long encoding takes to see how much of a problem this would be, but I can see it being a potential performance problem.

can you link to your rewrite? I am having trouble finding it.

TestingPlant commented 2 weeks ago

https://github.com/TestingPlant/hyperion/tree/restructure-net

The main changes are in egress.rs and the egress directory.

andrewgazelka commented 2 weeks ago

TestingPlant/hyperion@restructure-net

The main changes are in egress.rs and the egress directory.

do you mind making a draft PR so I can more easily look over diff?

andrewgazelka commented 2 weeks ago

How should we structure egress? In my egress rewrite, I currently iterate through each player and call a list of functions to append s2c packets and then send them for each player. Should this use ECS? If each function generating an s2c packet was a system using a Vec for bulk events, it would only be able to start sending data once each players' packets have been encoded. I still need to do benchmarks to see how long encoding takes to see how much of a problem this would be, but I can see it being a potential performance problem.

I don't think egress necessarily needs to be in ECS or needs to use ECS, but I do believe that the rest of the code should still utilize ECS.

So, what are you planning to do for other systems that append packets? How would that work?

TestingPlant commented 2 weeks ago

Broadcast packets would work like they currently are right now, and other systems would be moved into the egress directory to generate s2c packets.

andrewgazelka commented 2 weeks ago

Broadcast packets would work like they currently are right now, and other systems would be moved into the egress directory to generate s2c packets.

Can you give me an example of how

#[derive(Event)]
pub struct PointCompass {
    #[event(target)]
    pub target: EntityId,
    pub point_to: BlockPos,
}

would be handled?

TestingPlant commented 2 weeks ago

I've done some more benchmarks:

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use valence_protocol::{Encode, VarInt};
use std::io::Write;

struct BlackBoxWrite;

impl Write for BlackBoxWrite {
    fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
        black_box(buf);
        Ok(buf.len())
    }

    fn flush(&mut self) -> std::io::Result<()> {
        Ok(())
    }
}

pub fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("encode HealthUpdateS2c", |b| b.iter(|| black_box(valence_protocol::packets::play::HealthUpdateS2c {
        health: 0.0,
        food: VarInt(0),
        food_saturation: 0.0
    }).encode(BlackBoxWrite).unwrap()));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

showing that encoding a simple packet takes about 5.8ns on my machine. This means that we can encode about 170k simple packets per millisecond, so I think doing egress packet generation on one thread is fine.

With the current setup, the compass would be handled by adding a module in the egress directory which checks if the player is a zombie, and if so, generate a packet to point the compass to the nearest player. All of the data about the player and the rest of the server would need to be passed to each function manually which is the main disadvantage of not using ECS for egress while using ECS everywhere else.

I believe we should change packet encoding to use ECS now that we know packet encoding is cheap, and send an event with the output buffer and player entity id, but I'd like to hear from other people's feedback as well before doing this.

andrewgazelka commented 2 weeks ago

I've done some more benchmarks:
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use valence_protocol::{Encode, VarInt};
use std::io::Write;

struct BlackBoxWrite;

impl Write for BlackBoxWrite {
    fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
        black_box(buf);
        Ok(buf.len())
    }

    fn flush(&mut self) -> std::io::Result<()> {
        Ok(())
    }
}

pub fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("encode HealthUpdateS2c", |b| b.iter(|| black_box(valence_protocol::packets::play::HealthUpdateS2c {
        health: 0.0,
        food: VarInt(0),
        food_saturation: 0.0
    }).encode(BlackBoxWrite).unwrap()));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
showing that encoding a simple packet takes about 5.8ns on my machine. This means that we can encode about 170k simple packets per millisecond, so I think doing egress packet generation on one thread is fine.

With the current setup, the compass would be handled by adding a module in the egress directory which checks if the player is a zombie, and if so, generate a packet to point the compass to the nearest player. All of the data about the player and the rest of the server would need to be passed to each function manually which is the main disadvantage of not using ECS for egress while using ECS everywhere else.

I believe we should change packet encoding to use ECS now that we know packet encoding is cheap, and send an event with the output buffer and player entity id, but I'd like to hear from other people's feedback as well before doing this.

There are a couple of things I want to discuss.

First of all, even though encoding a simple packet takes 5.8 nanoseconds on your machine, encoding a packet is just one part of generating packets. You need to consider the actual logic. When it comes down to it, you'll probably want it to be multi-threaded. If you're encoding a packet and the logic can be parallelized, you'll likely want to do that encoding in parallel. Moreover, you don't want locks.

I think we should stick with Entity-Component-System (ECS). However, sending an event for every single output for every single player is not efficient. There is a non-negligible amount of time that goes into sending an event and getting it flushed. We should design everything to be parallelizable right now. From my personal tests, when I think something shouldn't be parallelized, it still ends up taking a significant portion of the game tick. This makes sense because if I have 24 cores and can do things 24 times faster, even if something is fast, being 24 times slower makes it really slow.

For now, ECS makes the most sense, especially from an extensibility standpoint. Although some have mentioned that ECS shouldn’t be used for extensibility, I highly disagree, especially for Rust. I think it works very well. I can present cases where I can achieve things in very few lines of code. Given Rust does not use Java-like OOP inheritance, I can't think of a better way to do this than ECS.

TestingPlant commented 2 weeks ago

I added a multithreaded benchmark:

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rayon::iter::{IntoParallelIterator, ParallelIterator};
use std::io::Write;
use valence_protocol::{Encode, VarInt};

struct BlackBoxWrite;

impl Write for BlackBoxWrite {
    fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
        black_box(buf);
        Ok(buf.len())
    }

    fn flush(&mut self) -> std::io::Result<()> {
        Ok(())
    }
}

struct Player {
    health: f32,
    food: i32,
    food_saturation: f32,
}

pub fn criterion_benchmark(c: &mut Criterion) {
    let players = (0..10_000)
        .map(|n| Player {
            health: n as f32,
            food: n,
            food_saturation: n as f32,
        })
        .collect::<Vec<_>>();
    c.bench_function("encode HealthUpdateS2c multi threaded", |b| {
        b.iter(|| {
            black_box(&players).into_par_iter().for_each(|player| {
                valence_protocol::packets::play::HealthUpdateS2c {
                    health: player.health,
                    food: VarInt(player.food),
                    food_saturation: player.food_saturation,
                }
                .encode(BlackBoxWrite)
                .unwrap()
            })
        })
    });
    c.bench_function("encode HealthUpdateS2c single threaded", |b| {
        b.iter(|| {
            black_box(&players).iter().for_each(|player| {
                valence_protocol::packets::play::HealthUpdateS2c {
                    health: player.health,
                    food: VarInt(player.food),
                    food_saturation: player.food_saturation,
                }
                .encode(BlackBoxWrite)
                .unwrap()
            })
        })
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

and got

encode HealthUpdateS2c multi threaded
                        time:   [42.568 µs 42.962 µs 43.402 µs]
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe

encode HealthUpdateS2c single threaded
                        time:   [57.296 µs 57.493 µs 57.765 µs]
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

I originally thought that the overhead of communicating between threads would cost more, but it seems like keeping it multithreaded would be good for our use case. Do you think the current structure on my fork of having packet generation in the egress directory is okay?

andrewgazelka commented 1 week ago

I added a multithreaded benchmark:

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rayon::iter::{IntoParallelIterator, ParallelIterator};
use std::io::Write;
use valence_protocol::{Encode, VarInt};

struct BlackBoxWrite;

impl Write for BlackBoxWrite {
    fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
        black_box(buf);
        Ok(buf.len())
    }

    fn flush(&mut self) -> std::io::Result<()> {
        Ok(())
    }
}

struct Player {
    health: f32,
    food: i32,
    food_saturation: f32,
}

pub fn criterion_benchmark(c: &mut Criterion) {
    let players = (0..10_000)
        .map(|n| Player {
            health: n as f32,
            food: n,
            food_saturation: n as f32,
        })
        .collect::<Vec<_>>();
    c.bench_function("encode HealthUpdateS2c multi threaded", |b| {
        b.iter(|| {
            black_box(&players).into_par_iter().for_each(|player| {
                valence_protocol::packets::play::HealthUpdateS2c {
                    health: player.health,
                    food: VarInt(player.food),
                    food_saturation: player.food_saturation,
                }
                .encode(BlackBoxWrite)
                .unwrap()
            })
        })
    });
    c.bench_function("encode HealthUpdateS2c single threaded", |b| {
        b.iter(|| {
            black_box(&players).iter().for_each(|player| {
                valence_protocol::packets::play::HealthUpdateS2c {
                    health: player.health,
                    food: VarInt(player.food),
                    food_saturation: player.food_saturation,
                }
                .encode(BlackBoxWrite)
                .unwrap()
            })
        })
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

and got

encode HealthUpdateS2c multi threaded
                        time:   [42.568 µs 42.962 µs 43.402 µs]
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe

encode HealthUpdateS2c single threaded
                        time:   [57.296 µs 57.493 µs 57.765 µs]
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

I originally thought that the overhead of communicating between threads would cost more, but it seems like keeping it multithreaded would be good for our use case. Do you think the current structure on my fork of having packet generation in the egress directory is okay?

With the current setup, the compass would be handled by adding a module in the egress directory which checks if the player is a zombie, and if so, generate a packet to point the compass to the nearest player. All of the data about the player and the rest of the server would need to be passed to each function manually which is the main disadvantage of not using ECS for egress while using ECS everywhere else.

I'm still not a big fan of this. I don't understand why packets need to be generated in the egress directory. Can you go over that again?

The whole point is to make a base that people can easily extend. It won't be extendable if you need to add the logic directly into Hyperion, which is what I understand needs to be done now.

I think we should make a custom handler parameter or something similar so that we can handle things natively from ECS. This way, it remains extendable.

TestingPlant commented 1 week ago

I'm still not a big fan of this. I don't understand why packets need to be generated in the egress directory. Can you go over that again?

The current egress code does the following:

for each player, do the following: a. call the packet generator function in each module in the egress directory, passing the buffer to append encoded packets to and any needed player and server data b. if buffer runs out of space, submit write events, wait for them to finish and then reencode packets with 1a; this isn't wasting time because sending data over TCP seems to be CPU bound in the kernel c. create write event
submit data

This can be multithreaded by splitting the registered buffer across multiple threads, so each thread can still append to a buffer without locks.

There are a few other ways to do this with ECS.

Option 1: Replace step 1a above with sending an event for each player to generate s2c packets. The event will contain the buffer to append s2c packets to and the player id. This forces it to be singlethreaded.

Option 2: Use per-player buffers and then send a single event to generate every players' packets. This requires each player to have a buffer with the largest amount of data that can be sent per tick, which is not ideal in terms of memory usage and cache

Feel free to suggest other options here.

The whole point is to make a base that people can easily extend. It won't be extendable if you need to add the logic directly into Hyperion, which is what I understand needs to be done now.

The current system could still be extendable by storing a Vec of fn pointers and passing data about the player and the world to each function. However, this wouldn't allow programs using Hyperion to access custom data stored through ECS. Alternatively, we could accept IntoHandlers directly and call them manually, but I'd need to figure out how this would work first.

andrewgazelka commented 1 week ago

Option 1: Replace step 1a above with sending an event for each player to generate s2c packets. The event will contain the buffer to append s2c packets to and the player id. This forces it to be singlethreaded.

Okay, I think I'm understanding what you're doing more now. From what I gather, correct me if I'm wrong, the reason you're doing this is to use less space. By filling up the buffer, waiting for the events to finish, and then putting more data into the buffer, we avoid having one giant buffer where everything is dumped all at once. Is this correct?

I'm building a system to handle events in parallel right now. It is currently fluctuating quite a bit.

https://github.com/andrewgazelka/targeted-bulk/blob/06513aaeaedc6b2a9af234bae5e7513ea7c46554/tests/minecraft.rs#L51-L111

andrewgazelka / hyperion

RFC: Architecture #423

Current problems

No system-level parallelism

Entity-level parallelism requires often not having targetted events

Possible alternatives

Concrete idea

Further thoughts