Command execution order does not always match system execution order in the case of ambiguous systems

RedMindZ commented 10 months ago

Bevy version

0.11.2

What you did

I have two systems that take a mutable reference to a resource that represents a document. The document is essentially a list of spawnable mesh + material pairs. The code looks something like this:

#[derive(Resource)]
struct MainDocument(Vec<Art>);

struct Art {
    entity: Option<Entity>,
    mesh: Handle<Mesh>,
    material: Handle<StandardMaterial>,
}

fn show_art_system(mut commands: Commands, document: ResMut<MainDocument>) {
    // make the art visible using `commands` by spawning entities with the MaterialMeshBundle and store the entity id in the art
}

fn hide_art_system(mut commands: Commands, document: ResMut<MainDocument>) {
    // hide the art using `commands` to despawn the entities stored in the art
}

What went wrong

Sometimes, the show_art_system will panic with the following error: error[B0003]: Could not insert a bundle (of type (bevy_pbr::bundle::MaterialMeshBundle<bevy_pbr::pbr_material::StandardMaterial>, bevy_render::view::visibility::NoFrustumCulling)) for entity 19v0 because it doesn't exist in this World.

In my setup, the show_art_system always runs before the hide_art_system, and the hide_art_system always checks if the Art has an entity to despawn before calling despawn. When looking at my logs, this is also what I see: the art gets spawned first, and then despawned later.

This panic only happens when the art is spawned and despawned in the same frame. After a lot of digging, what I discovered is that show_art_system and hide_art_system receive different Command buffers, and that even though these systems should be synchronized since they request mutable access to the same resource, sometimes (somewhat rarely) their command buffers will be applied in reverse order relative to the order in which the systems ran. This leads to the entity being spawned, then the Despawn command is applied to it, and then the Insert command is applied to it, leading to a panic since the entity was already despawned.

I am not sure if this is a bevy bug or intended behavior. If this is intended, I would appreciate some direction on how to structure my code to avoid this panic.

Additional information

Sadly, I don't have a simple, consistent way to reproduce this, so I don't have a minimal code example that reproduces the issue. My actual code that causes this behavior only panics in about ~10% of runs, so it would probably be a bad example, since it also contains a lot of other noise.

jancespivo commented 10 months ago

Hi, you are right commands are run asynchronously. It is somewhat described in unofficial cheatbook https://bevy-cheatbook.github.io/programming/commands.html but there were changes in ECS in 0.10 (https://bevyengine.org/news/bevy-0-10/#ecs-schedule-v3) and AFAIK there are no stages anymore. Concurrency issues can be annoying and can bite literally anyone so IMO it is the important topic we should focus.

Could you provide a minimal example to reproduce the bug (it is valuable even if it crashes only in 10%)? :pray:

RedMindZ commented 10 months ago

I'll try to create a minimal example to reproduce the bug.

alice-i-cecile commented 10 months ago

Commands are definitely supposed to run in system order: if they do not, that is a bug in Bevy.

RedMindZ commented 10 months ago

I finally have a minimal example. I had to dig deeper into bevy to create the example and I think I found the bug: The multi-threaded executer that runs the user systems receives the systems to run as a slice. The order of the systems in the slice determines the order in which their commands will be applied. So if the slice contains the following systems: [auxiliary, despawner, spawner], then first the commands of auxiliary will be applied, then those of despawner, and lastly those of spawner. This is because the executer uses a FixedBitSet to store the unapplied_systems in the same order as that of the slice. What could happen, is that auxiliary and despawner both require mutable access to a Resource, meaning they can't run at the same time. If spawner can run together with auxiliary then they will both be executed together, and once auxiliary finishes, despawner will be executed. That means that even though spawner will be executed first, its commands will only be applied after the commands from despawner, since despawner appears first in the unapplied_systems of the executer. The reason this bug doesn't always happen is because the order of the slice that the executer receives appears to be non-deterministic: it seems to have a different value every run of the program.

To summarize the order of events in the bug:

The user defines 3 systems [auxiliary, despawner, spawner], where:
1. auxiliary and despawner both request mutable access to a Resource, meaning they can't run at the same time.
2. despawner and spawner both request mutable access to a different Resource, meaning they can't run at the same time.
3. auxiliary and spawner are independent, meaning they can run at the same time.
MultiThreadedExecutor::run is called with schedule = [auxiliary, despawner, spawner]
MultiThreadedExecutor::spawn_system_tasks is called by run, and only spawns auxiliary and spawner because despawner can't run with auxiliary
1. auxiliary does nothing
2. spawner calls Commands::spawn to spawn an entity with a component
spawn_system_tasks is called again in the next iteration of the loop in run, this time spawning despawner
1. despawner calls EntityCommands::despawn to despawn the entity spawned by spawner
MutliThreadedExecutor::apply_deferred is called to apply the commands from each system in their original order [auxiliary, despawner, spawner]
1. For auxiliary, there is nothing to apply.
2. For despawner, the Despawn command is applied.
3. For spawner, the Insert command is applied. [This is the point where bevy panics].

Finally, here is the example to reproduce this bug:

use bevy::prelude::*;

#[derive(Resource)]
struct MainEntity(Option<Entity>);

#[derive(Resource)]
struct SharedResource;

fn main() {
    App::new()
        .insert_resource(MainEntity(None))
        .insert_resource(SharedResource)
        .add_plugins(DefaultPlugins)
        .add_systems(Startup, setup)
        .add_systems(Update, (auxiliary, despawner, spawner))
        .run();
}

fn setup(mut commands: Commands) {
    commands.spawn(Camera3dBundle::default());
}

fn auxiliary(_shared: ResMut<SharedResource>) {}

fn despawner(
    mut commands: Commands,
    mut main_entity: ResMut<MainEntity>,
    _shared: ResMut<SharedResource>,
) {
    if let Some(entity) = main_entity.0 {
        commands.entity(entity).despawn();
        main_entity.0 = None;
    }
}

fn spawner(mut commands: Commands, mut main_entity: ResMut<MainEntity>) {
    if let None = main_entity.0 {
        let entity = commands.spawn(TransformBundle::default()).id();
        main_entity.0 = Some(entity);
    }
}

Due to the non-deterministic ordering of the systems in the schedule, you would probably have to run this example multiple times before it panics. I used bevy 0.11.3 to run the example, but this also happens in 0.11.2 and in the main branch.

alice-i-cecile commented 10 months ago

Thanks for digging! So far this looks like "working as intended" behavior, even if it is a pain. Does this still occur when your two systems that emit commands are ordered?

RedMindZ commented 10 months ago

When using:

add_systems(Update, (auxiliary, (spawner, despawner).chain()))

in the minimal example, there seems to be no crash. Same for my original application: it doesn't appear to crash when chaining the equivalent systems. Of course, its possible I just haven't checked enough times given the random nature of the crashes. At the same time, I can only chain the systems in the branch of my app I used to find this bug. In my original code, all three equivalents of auxiliary, despawner, and spawner belong to different plugins that don't give a simple way to schedule their systems. Of course, I could add this functionality to the plugins and order their systems in main, but that is a very fragile solution that would require being familiar with this issue and manually ordering all future systems that could potentially cause it. I doubt this kind of solution could scale to a large number of systems.

Even if this is intended behavior, I think it should still be changed for 2 reasons:

The application of command buffers doesn't actually follow execution order, which is something both my application (and I assume other people's apps) depend on. Without this, having two systems that mutably access the same resource becomes unreliable.
The biggest problem with this behavior is that it can happen unexpectedly when using plugins: Imagine that the despawner system requests mutable access to the camera. If there are any other systems requesting mutable access to the camera (for example, if you use a plugin to add camera controls), then those other systems will act like the auxiliary system in the example, leading to the application randomly crashing, making it superbly difficult to debug. This is what actually happened in my application: when I tested removing the camera controller, suddenly the random crashes stopped (which is also very difficult to verify, since the crashes are random), which initially left me really confused because the camera system is completely unrelated to the system that bevy reports as the culprit.

I think that at the very least, there should be a warning in the documentation of Commands that says that commands from different systems may execute in arbitrary order.

alice-i-cecile commented 10 months ago

That's pretty reasonable: do you have an idea of what your ideal behavior would look like here?

RedMindZ commented 10 months ago

I think the simplest solution is to have the executer record the order in which systems are spawned, and apply the command buffers in that order.

For systems that are spawned together (in the same call to spawn_system_tasks) the order will remain as is. That means that all systems that are spawned together in the same call will have their commands applied in the order they appear in the schedule's slice.

For systems that are spawned in different calls to spawn_system_tasks, the systems that were spawned in earlier calls should be applied first.

For example, if the scheduled systems come in the following slice: [A, B, C, D, E, F, G] And they are spawned as follows:

spawn_system_tasks(A, C, D, G)
spawn_system_tasks(B, F)
spawn_system_tasks(E)

Then they should be applied in the order: A, C, D, G, B, F, E

This way, if a system has a dependency on the actions of another system (for example, when sharing state through resources), bevy will already make sure to spawn them one after the other (and most importantly, not in parallel), and then the commands will be applied in the same order that the shared state was modified in.

maniwani commented 10 months ago

What happens right now is that systems run at the first available opportunity—which can vary for pairs of ambiguous systems—while their command queues are applied in the pre-determined topological order.

I'm not against "command queues should be applied in the same order that systems run", but some consequences are:

It will reduce consistency when switching between the SingleThreadedExecutor and MultiThreadedExecutor. Going from serial to parallel-capable execution would introduce significantly more variation than it currently does.
It will pretty much lose any possibility of the MultiThreadedExecutor exhibiting usable determinism. Maybe that was already a lost cause, but I had some hope that with better Entity reservation and the ability to somehow save and load a schedule's topological order, you'd be able to embed a deterministic sequence of logic within a schedule running with the MultiThreadedExecutor. But, for example, having even one pair of conflicting systems with uncertain order push commands that modify the same part of a networked simulation would break it.

RedMindZ commented 10 months ago

Its true that there will be an increase in the variability of the order in which commands are applied, but what we gain in return is the ability to observe that order from the systems. Currently the only way to know this order is by looking at the implementation of the executer and figuring out that you could know the order by inspecting the scheduler. In other words, the only way to know the order for ambiguous systems is by knowing the implementation details, which is probably not something the user should be aware of.

As for determinism, I think it would be sensible to require a deterministic system ordering (meaning having zero ambiguities). Having ambiguous systems ran by the multithreaded executor would always be dependent on the time it takes each system to run: If a certain system would suddenly take longer to run, the executer can start another system at the same time, which would not happen if the system finished instantly instead.

maniwani commented 10 months ago

Currently the only way to know this order is by looking at the implementation of the executer and figuring out that you could know the order by inspecting the scheduler. In other words, the only way to know the order for ambiguous systems is by knowing the implementation details, which is probably not something the user should be aware of.

bevy_mod_debugdump can provide you with a detailed visualization of a schedule. That or something like it could be upstreamed in the future.

bevyengine / bevy