DrawBatch::Submit z_order isn't always respected

BobG1983 commented 4 years ago

I have two systems, one rendering the map (https://github.com/RantingBob/escape-from-alpha-nine/blob/feature/drawbatch/src/systems/render_systems/map_renderer.rs) one rendering entities (https://github.com/RantingBob/escape-from-alpha-nine/blob/feature/drawbatch/src/systems/render_systems/entity_renderer.rs).

The map renderer submits with a Z-Order of 1 to Console 0 The entity renderer submits with a Z-Order of 2 to Console 0

I would expect that things rendered in the entity renderer would render on top of the things rendered in the map renderer. This happens sometimes, but not always. I assume due to the order of calling submit based on https://github.com/thebracket/bracket-lib/blob/master/bracket-terminal/src/consoles/command_buffer.rs#L166

Its possible I'm not fully understanding the purpose of z_order, and should instead be using multiple sparse consoles for each of my layers (though I imagine this would come at a cost of overdraw?). I did try that, I added a sparse console and set the drawbatchs target to be that console. I had the exact same issue.

Repro is here: https://github.com/RantingBob/escape-from-alpha-nine/tree/feature/drawbatch Repro using consoles: https://github.com/RantingBob/escape-from-alpha-nine/tree/using-consoles

BobG1983 commented 4 years ago

FYI, if I set my ZOrder to have extremely large gaps (like 0, 1M, 2M, etc.) then it works as expected. It does appear to be that https://github.com/thebracket/bracket-lib/blob/master/bracket-terminal/src/consoles/command_buffer.rs#L166 is at fault. I have that as a work around for now.

singalen commented 3 years ago

Observing the same behavior. The entity last updated draws on top, even though its Z-order is is lower.

The Render entity that I called

commands.remove_component::<Inventory>(entity);
commands.add_component(entity, pos);

on now draws on top of Player, even though it has Render.layer == 1 and Player has Render.layer == 2

My rendering code:

#[derive(Clone, Copy, PartialEq)]
pub struct Render {
    pub color: ColorPair,
    pub glyph: map::Glyph,
    pub layer: i16, // 0..2, currently
}

#[system]
#[read_component(Point)]
#[read_component(Render)]
pub fn entity_render(
    ecs: &SubWorld, 
    #[resource] camera: &Camera
) {
    let mut draw_batches: [_; 3] = [
        DrawBatch::new(), DrawBatch::new(), DrawBatch::new()
    ];

    for batch in draw_batches.iter_mut() {
        batch.target(0);
    }

    let offset = Point::new(camera.left_x, camera.top_y);

    <(&Point, &Render)>::query()
        .iter(ecs)
        .for_each(|(pos, render)| {

            let glyph = glyph_to_u16(render.glyph);

            draw_batches[render.layer as usize].set(
                (*pos - offset).into(),
                render.color.into(),
                glyph
            );
        });

    for (i, batch) in draw_batches.iter_mut().enumerate() {
        batch.submit(5000 + i).expect("Batch error");
    }
}

waldoplus commented 3 years ago

DrawBatch holds collection of draw commands. For example, if you use draw_batch.set(...) in a loop that renders 30 entities, then your batch contains 30 drawing commands. Let's say that you submit that batch with a z-order of 0, then slots 0-29 are going to be filled. To render your player on top, it should have a z-order of 30. Here's a playground. See if you can convince yourself that's true by switching z-order to 30 instead of 29 or by playing with the number of draw commands issued in the first batch and changing the z-order of the second submission.

use bracket_lib::prelude::*;
struct State {}

impl GameState for State {
    fn tick(&mut self, ctx: &mut BTerm) {
        let mut draw_batch = DrawBatch::new();
        for _ in 0..=30 { 
            draw_batch.print(Point::new(0, 0), "Enemy!".to_string());
        }
        draw_batch.submit(0).expect("...");

        let mut draw_batch = DrawBatch::new();
        draw_batch.print(
            Point::new(0, 0), 
            "Player".to_string()
        );
        // switch z-order to 30 to show player on top
        draw_batch.submit(29).expect("...");  

        render_draw_buffer(ctx).expect("...");
    }
}

fn main() -> BError {
    let context = BTermBuilder::simple80x50()
        .build()?;

    let gs: State = State {};
    main_loop(context, gs)
}

In your code, if you submit z-orders of 5000, 5001, 5002, respectively, and let's say the first batch renders a 100 entities, then that space is used up and your player will not show on top.

I think the confusion comes by labelling each batch as a separate layer. They're not layers, just a series of draw commands. I think the example below demonstrates that nicely. Let's say you had only one batch with two subsequent draw commands added to the same position. Which would render on top?

fn tick(&mut self, ctx: &mut BTerm) {
    let mut draw_batch = DrawBatch::new();
    draw_batch.print(Point::new(0, 0), "Enemy!".to_string());
    draw_batch.print(Point::new(0, 0), "Player".to_string());
    draw_batch.submit(0).expect("...");
    render_draw_buffer(ctx).expect("...");
}

Hope that helps.

BobG1983 commented 3 years ago

Oh yeah, as I said in my very initial post, I understand how it works (I pointed to the specific line in the code that causes the behavior). My expectation was that within a given Z-order all objects would be drawn in submission order, then for the next Z-order they'd be drawn in order etc. That's certainly what I'd expect from other engines.

I'd consider maintaining a list for each z-order, then drawing each list in order. That would "solve" the unexpected behavior. Especially given that a given Z-layer doesn't necessarily have a specified number of objects in it.

waldoplus commented 3 years ago

I see your point. It would be easier and it definitely confused me the first time using it. Especially if you have an unknown number of entities in a batch.

I don't use draw_batch to manage Z-order. In my mind, the main purpose of the draw batch is to run all your render calculations in parallel in the multi-threaded ECS environment. One way to simplify things is to have separate map and entity layers. Then there's still the common roguelike problem where your player entity occupies the same tile as an item entity. They are both on the entity layer and you want your player to appear on top. No need to have a separate batch. In your entity render system, sort your entities by a z-order given on their render component. The one with the highest Z-order would gets added to the draw batch. It's more efficient to do this anyway, rather than instantiating several draw batches and submitting them.

singalen commented 3 years ago

In your entity render system, sort your entities by a z-order given on their render component. The one with the highest Z-order would gets added to the draw batch. It's more efficient to do this anyway, rather than instantiating several draw batches and submitting them.

But… general sorting complexity is O(N*log N), and there can be thousands of entities. Sort entities on each frame? Instead of sorting, in my case, 3 draw batches?

OK, sorting into 3 buckets is just 3 passes over N entities. But still it’s unnecessary inefficient.

waldoplus commented 3 years ago

I mean use one draw batch in your entity render system. In terms of a bucket, the entities with the highest Z-order that occupy the same tile are dropped in last, overwriting the previous.

Regarding the time complexity, the question is what you think each draw batch is doing under the hood. Does each draw batch have to be iterated and added to the master virtual console before drawing? How does that affect the time complexity and how does compare to the time complexity of a simple sort?

Honestly, I try to keep my entity number lower per depth. I'm currently using about 50. It keeps my game lean. With about 20 systems, iterating through a large number of entities for each system is expensive. I don't know your game but I thought I'd offer my advice. Whatever works best for you.

BobG1983 commented 3 years ago

Again, I'm aware of the workarounds, but this is, in my opinion, a bug. At the very least it's inconsistent with other game engines/frameworks where for N Z-order all objects will always be drawn before N+1 Z-Order.

This also allows you to do clever optimizations under the hood (like only drawing the top most non-transparent glyph).

waldoplus commented 3 years ago

I'm inclined to agree. The naming of the parameter as z-order definitely implies that it should be used for z-ordering. I don't see any disadvantages in doing it as you describe. Would certainly be less confusing.

For comparison / future design, I bet it would be helpful if you gave a specific example of another batching system. I think you already know that we are not just talking about simple console layers but rather thread-safe batching. For Rust, I've used doryen-rs and Bevy with doryen-rs plugin to render but didn't see a similar feature.

On a side note, while batching is better than not batching, I was curious how much batching inside the ECS actually improves performance, as opposed to just batching at the end of your game loop. I think that this batch system uses parking_lot. Given the mutex blocks anyway, just how much parallel processing is going on? Same just to submit a batch at end of game loop? Hope that the author can comment on the design choices.

thebracket commented 3 years ago

This definitely isn't as clear as it could be, I'll have a look to see how I can go about fixing the merge portion (where the batches are applied) to be a bit more robust about that.

The Batching Process

In answer to the questions about optimization/how it helps with concurrency, let's do a quick walk-through of how a batched session works:

You request a new DrawBatch with new. This very briefly locks a pool of buffers, assigning an empty one to you. It uses object pooling, because my benchmarking showed that this could be a big performance win overall---the buffer creation overhead is up front, and you get a "warm" and usable buffer. (It also keeps memory usage consistent)
The DrawBatch you just acquired is now detached from the pool, there's no locking going on. A single batch isn't designed to be thread safe, it's designed for use within a thread. So you push commands to it, which is a vector push into its command buffer. (The buffer has a lot of space pre-allocated, so it's very unlikely that you'll cause an allocation/move here)
Eventually, you call submit. This does lock (I couldn't find a good way to not lock), and appends the commands to the global command buffer. Once your batch falls out of scope, it returns to the batch pool.
At some point your main loop will call render_draw_buffer. This sorts by z-order, and calls the various terminal draw commands in order. This does lock, but it should be uncontested (which is effectively free with parking_lot)---you shouldn't be running this outside of your main thread anyway.

It started out as a way to make terminal submission work without trying to share the overall context within Specs (or Legion, etc.)' resources/system data. The pooling and other optimizations came along when I discovered that the overall model performed really well---far better than I expected.

Optimizations

The degree to which batching helps you is a bit tricky, because it depends upon the type of console you are using.

You do get some immediate wins: writing to your own batch vector is very fast.
Submitting everything at once is very cache-friendly, giving a bigger performance boost than I expected.
Being able to build your batches in threads can be a big win, depending upon how your program works.

Simple Consoles

The simple console maintains a buffer of glyphs at each screen location, updated in order. The renderer will never perform more than one draw call for a tile. So you automagically get the "highest z" resulting glyph on a tile. It deliberately doesn't try to filter out transparent calls - you might have decided to clear a tile, and I have no way of knowing if you meant to do that.

On GL backed renders, the whole console is rendered as a GPU buffer. I experimented with just rendering what was needed for a frame, and it was a performance loss---it's faster to keep a buffer around (rebuilding when needed) and submit the whole thing. On some cards (notably Intel), I even found that it was faster to submit a new buffer than to mess with the GL functions for changing part of one. Some GPU buffer locking code is apparently pretty awful.

On Curses/Crossterm, rendering glyphs is surprisingly expensive. So there's a "dirty buffer" system that retains an overall view of the console and only re-renders items that changed. It tries to be smart - so if you cls and then re-draw the same thing, it won't actually clear the screen and start over. This keeps it usable over SSH and in a regular terminal. Still not as great as I'd like, but not bad.

Sparse Consoles

Sparse consoles are an odd duck. If you only have a few things to render, they perform pretty well---just drawing what's needed. If you basically draw the whole screen on a sparse console, they are going to be slower. There are some optimizations to keep it chugging along, but the nature of the console (appending characters to a draw list, rather than a simple grid) will always give a performance hit for a big batch.

Fancy and Sprite Consoles

There's basically no way to know how to handle overdraw with these, so it doesn't try. It just renders the characters you specify, where you specified them. The sprite system works, but I probably wouldn't try and write an enormous 2D game with them.

Fixing z-order

Looking at the code, I'm going to have to do a bit of work & thinking to retain the performance advantages and separate each batch (sorted by z-order). I think that submitting the buffer will have to keep the whole buffer separated so I can sort the buffers rather than the draw commands. I'll play around a bit and see what I can come up with.

One thing to bear in mind is that bracket-lib as a whole is a teaching tool. It's focused on being the support library for the roguelike tutorial, my Hands-on Rust book, and so on. So whenever I have to make a hard choice between a beginner-friendly interface and some performance enhancements, I go with the former unless I really have to make things trickier. The other issue is that there are a lot of users, many of whom have a book I can't readily change. So I have to keep the interface the same, or I'm going to get a lot of angry emails.

thebracket commented 3 years ago

Since this is quite a large issue, I'm working on it in the command_buffers_140 branch. I just committed a patch that helps quite a bit, and seems to retain some performance (I had to do a bit of a dance with vectors to make it fast; avoiding cloning buffers at all costs!)

BobG1983 commented 3 years ago

That was a great read and what you suggested as to how to use it is pretty much exactly what I want to do.

As for the API. Please don’t change it. It’s a delight.

Sent from ProtonMail for iOS

On Thu, Jul 8, 2021 at 2:14 PM, thebracket @.***> wrote:

This definitely isn't as clear as it could be, I'll have a look to see how I can go about fixing the merge portion (where the batches are applied) to be a bit more robust about that.

The Batching Process

In answer to the questions about optimization/how it helps with concurrency, let's do a quick walk-through of how a batched session works:

You request a new DrawBatch with new. This very briefly locks a pool of buffers, assigning an empty one to you. It uses object pooling, because my benchmarking showed that this could be a big performance win overall---the buffer creation overhead is up front, and you get a "warm" and usable buffer. (It also keeps memory usage consistent)

The DrawBatch you just acquired is now detached from the pool, there's no locking going on. A single batch isn't designed to be thread safe, it's designed for use within a thread. So you push commands to it, which is a vector push into its command buffer. (The buffer has a lot of space pre-allocated, so it's very unlikely that you'll cause an allocation/move here)

Eventually, you call submit. This does lock (I couldn't find a good way to not lock), and appends the commands to the global command buffer. Once your batch falls out of scope, it returns to the batch pool.

At some point your main loop will call render_draw_buffer. This sorts by z-order, and calls the various terminal draw commands in order. This does lock, but it should be uncontested (which is effectively free with parking_lot)---you shouldn't be running this outside of your main thread anyway.

It started out as a way to make terminal submission work without trying to share the overall context within Specs (or Legion, etc.)' resources/system data. The pooling and other optimizations came along when I discovered that the overall model performed really well---far better than I expected.

Optimizations

The degree to which batching helps you is a bit tricky, because it depends upon the type of console you are using.

You do get some immediate wins: writing to your own batch vector is very fast.

Submitting everything at once is very cache-friendly, giving a bigger performance boost than I expected.

Being able to build your batches in threads can be a big win, depending upon how your program works.

Simple Consoles

The simple console maintains a buffer of glyphs at each screen location, updated in order. The renderer will never perform more than one draw call for a tile. So you automagically get the "highest z" resulting glyph on a tile. It deliberately doesn't try to filter out transparent calls - you might have decided to clear a tile, and I have no way of knowing if you meant to do that.

On GL backed renders, the whole console is rendered as a GPU buffer. I experimented with just rendering what was needed for a frame, and it was a performance loss---it's faster to keep a buffer around (rebuilding when needed) and submit the whole thing. On some cards (notably Intel), I even found that it was faster to submit a new buffer than to mess with the GL functions for changing part of one. Some GPU buffer locking code is apparently pretty awful.

On Curses/Crossterm, rendering glyphs is surprisingly expensive. So there's a "dirty buffer" system that retains an overall view of the console and only re-renders items that changed. It tries to be smart - so if you cls and then re-draw the same thing, it won't actually clear the screen and start over. This keeps it usable over SSH and in a regular terminal. Still not as great as I'd like, but not bad.

Sparse Consoles

Sparse consoles are an odd duck. If you only have a few things to render, they perform pretty well---just drawing what's needed. If you basically draw the whole screen on a sparse console, they are going to be slower. There are some optimizations to keep it chugging along, but the nature of the console (appending characters to a draw list, rather than a simple grid) will always give a performance hit for a big batch.

Fancy and Sprite Consoles

There's basically no way to know how to handle overdraw with these, so it doesn't try. It just renders the characters you specify, where you specified them. The sprite system works, but I probably wouldn't try and write an enormous 2D game with them.

Fixing z-order

Looking at the code, I'm going to have to do a bit of work & thinking to retain the performance advantages and separate each batch (sorted by z-order). I think that submitting the buffer will have to keep the whole buffer separated so I can sort the buffers rather than the draw commands. I'll play around a bit and see what I can come up with.

One thing to bear in mind is that bracket-lib as a whole is a teaching tool. It's focused on being the support library for the roguelike tutorial, my Hands-on Rust book, and so on. So whenever I have to make a hard choice between a beginner-friendly interface and some performance enhancements, I go with the former unless I really have to make things trickier. The other issue is that there are a lot of users, many of whom have a book I can't readily change. So I have to keep the interface the same, or I'm going to get a lot of angry emails.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

waldoplus commented 3 years ago

One thing to bear in mind is that bracket-lib as a whole is a teaching tool. It's focused on being the support library for the roguelike tutorial, my Hands-on Rust book, and so on. So whenever I have to make a hard choice between a beginner-friendly interface and some performance enhancements, I go with the former unless I really have to make things trickier. The other issue is that there are a lot of users, many of whom have a book I can't readily change. So I have to keep the interface the same, or I'm going to get a lot of angry emails.

I did buy the Hands On Rust book and see that it is now coming out in hard copy. Congrats! Although it might be a breaking change for both books, it might be a win-win for ease of use & enhancements. But don't worry! You won't get an angry e-mail from me because I know how to pin my dependencies. But I can't promise that for the rest of them. Good luck!

Being able to build your batches in threads can be a big win, depending upon how your program works.

It's been phenomenal. I can build up games scenes (world map, inventory, targeting, etc) simply by including only the render systems I want in each custom Legion schedule. It's almost like compositing an image in Photoshop.

I think that submitting the buffer will have to keep the whole buffer separated so I can sort the buffers rather than the draw commands.

👍

vks commented 3 years ago

So I have to keep the interface the same, or I'm going to get a lot of angry emails.

Isn't that issue solved by semver?

thebracket commented 3 years ago

I've been testing the commit I posted above, and it's looking pretty good. Performance is still really solid, and it should solve the z-order issue by running batches on their z-order rather than trying to normalize all of the draw commands. I still have a few projects to test it on, but I'm really hopeful.

And yes - semver does help with API issues. I just have to be really careful, because of the "treat <1.0 differently" rule.

thebracket commented 3 years ago

Alright, testing this patch on the Rust Roguelike Tutorial (most recent chapter) showed that it not only worked, but performed a bit better. It also runs well for the Hands-on Rust example code. :-)

I'll get this merged a bit later today.

nornagon commented 2 years ago

FWIW I think this could be considered a non-breaking change. Presumably, for the majority of cases, users of the current API use the z-order numbers as layers, but instead of doing 1, 2, 3 to split their layers, they use 10000, 20000, 30000 because you have to "leave room" for the previous layer. In these situations, having more than 10k draw commands in one batch would be a bug, because then the z-order becomes undefined.

If the API changed to use logical layers rather than "offsets", then you'd just be silently fixing that bug, since 20000 is already larger than 10000, so the z-order will remain unchanged :)

Up to you whether you consider overlapping batches a useful feature, though, I suppose...

nornagon commented 2 years ago

BTW, looks like this was merged in ef0d2b39085c64db46e856dc5daa72f3d0e2bce3, so perhaps this issue should be closed? Not sure if there have been any releases with that commit included yet (https://github.com/amethyst/bracket-lib/issues/229 seems to corroborate such confusion).

thebracket commented 2 years ago

It should be fixed in the github version. I have it flagged as needing some additional testing, but it seems to work for me. I'll try and start tagging releases with the next one; I have quite a backlog to work through (writing books is ridiculously time consuming!), so I'm going to try to make a decent dent in the issues list before the next release.

amethyst / bracket-lib

DrawBatch::Submit z_order isn't always respected #140