PistonDevelopers / turbine

A 3D game engine with built-in editor
Apache License 2.0
139 stars 7 forks source link

Profiling Ui - Round 1 #31

Open bvssvni opened 9 years ago

bvssvni commented 9 years ago

Testing to see if we can make some performance improvements in Conrod.

Rendering 21 buttons. Profiling using Instruments on OSX.

screen shot 2015-11-02 at 21 38 02

With UI:

screen shot 2015-11-02 at 21 38 13

Without UI (the spikes are when turning UI on to start/stop profiling):

screen shot 2015-11-02 at 21 38 26

bvssvni commented 9 years ago

I also ran a benchmark on the texture_swap example to get an estimate of how fast it could be. On my machine, one textured rectangle should take approximately 0.000016 seconds. If we include the frame and extra stuff, let's say 0.00005, then rendering 21 buttons should take no more than 0.00105 seconds. About 1 millisecond in worst case.

Need an estimate on how much Conrod uses per button.

bvssvni commented 9 years ago

Notice: This was tested in debug mode, so the result is invalid.

Running in bench mode for 2000 frames.

21 buttons: 20.701 11 buttons: 13.141 1 button: 5.600

(20.701 - 5.600)/20/2000 = 0.000377525
(13.141 - 5.600)/10/2000 = 0.00037705

This is 7.5 times slower than it could be.

mitchmindtree commented 9 years ago

Hmmm does the profiler's call tree give you a percentage distribution of where the most time is being spent?

The last time I checked, I think the drawing in elmesque is a big bottleneck for conrod - maybe it's worth checking out?

draw_element draw_form.

bvssvni commented 9 years ago

@mitchmindtree I suspect that the drawing is not the biggest bottleneck, but when the elements are constructed. They make use of allocation which is a slow operation for rendering.

I haven an idea: Since most of the interface consists of rectangles, perhaps we can find a solution where rectangle shapes are cheap?

bvssvni commented 9 years ago

Btw, I don't want to draw conclusions at this point. There could be something in the drawing that makes it slow. However, the profiler shows _platform_memmove at the top which suggests something to do with memory.

mitchmindtree commented 9 years ago

Ahh I see! I always worried that all the boxing elmesque uses in its recursive Form and Element data structures might be an issue in this way :worried:

I've been thinking about this for a while too (making layout cheaper). All of the unnecessary boxing that occurs happens within the Widget::draw implementations due to the elmesque Form and Element related functions. Widget::draw is basically just a way for a widget to produce a description of how it is to be drawn, but without actually drawing the widget (that happens later when Ui::draw or Ui::draw_if_changed is called). I've been thinking of two options for making the graphics "description" and layout stuff cheaper:

1. Re-think how elmesque's recursive Form and Element data structures work internally.

It might be nice to try this first as it might be a bit easier/quicker than option 2? However I still haven't been able to think of a more efficient way to do this that doesn't involve changing the elm-style purely functional API or adding some hidden global state graph or something. Maybe a solution could be to use carboxyl's FRP API? I'm not certain if this would be a benefit though without also re-working conrod to use it, which would might be a pretty huge/unnecessary process.

2. Create a conrod specific graphics description/layout system inspired by elmesque.

This would be quite the breaking change, however there could be a number of benefits to this approach:

I can picture this being done as a "graphics layout tree", where rather than building the tree using boxed recursion, we provide a little tree data structure (a wrapper around something like rose_tree) to each widget which only gets allocated once and can be re-used every Widget::draw. The API could be very similar to as it is now, just with the tree being described in the data structure rather than the type recursion, removing the need for all the elmesque related allocations. We might even be able to do all widgets within a single tree - rather than providing each widget with it's own unique tree, we could provide a safe wrapper around the branch of the master tree that represents that widget's graphics layout.

Hmmm

I think I've been heavily leaning towards option 2 for a while now - despite being quite a bit of work, it feels like a much more future proof option having something specific to conrod. Lemme know your thoughts or if you have any other ideas! I might have a crack at this tree thing tonight in a separate branch while it's fresh in my mind.

bvssvni commented 9 years ago

This profiling was done in debug mode, so it doesn't represent a reliable benchmark. The previous estimates of how long time spent rendering a button is not representative of the time spent in a released application. Conrod is a lot faster in release mode.

Improved optimizations in the Rust compiler might also have affected the results. I forgot to include the version used in this test.

In order to evaluate https://github.com/PistonDevelopers/conrod/pull/626 properly we need to redo the estimates before Elmesque is removed from Conrod. This gives us a closer base to what level of performance of improvement we can expect by switching to primitives.

Ran again for 2000 frames, but now in release, making sure that the overhead from Cargo is removed.

rustc 1.6.0-nightly (5b4986fa5 2015-11-08)

cargo build --release --example hello_world
time ./target/release/examples/hello_world

Code changes:

    let mut frames = (0..2000).into_iter();
    for mut e in window.bench_mode(true) {
        if let Some(_) = e.render_args() {
            if frames.next().is_none() { break; }
        }
        ...
       if !capture_cursor {
            ui.handle_event(&e);
            e.draw_2d(|c, g| {
                use conrod::*;

                widget_ids!(REFRESH);

                Button::new()
                    .color(color::blue())
                    .top_left()
                    .dimensions(60.0, 30.0)
                    .label("refresh")
                    .react(|| {})
                    .set(REFRESH, &mut ui);

                for i in 0..20 {
                    Button::new()
                        .color(color::blue())
                        .down(0.0)
                        .dimensions(60.0, 30.0)
                        .label("refresh")
                        .react(|| {})
                        .set(REFRESH + 1 + i, &mut ui);
                }

                ui.draw(c, g);
            });
        }
    }

Ran 3 times using Conrod 0.22.2, deleting the slowest:

0m6.009s
0m6.011s

Ran 3 times, deleting the slowest using https://github.com/mitchmindtree/conrod/commit/861726a4c7b03b3af918047470fc9ed83b76c55c (before Elmesque was removed from Conrod):

0m6.400s
0m6.266s

We see that https://github.com/PistonDevelopers/conrod/pull/626 is in the same ballpark, but a little slower. Notice that Elmesque is not removed yet, and the PR is still work-in-progress, so it looks promising.

bvssvni commented 9 years ago

Made a spread sheet that I will add to the piston-examples repo:

rustc 1.6.0-nightly (5b4986fa5 2015-11-08)

screen shot 2015-11-15 at 20 04 58

I am getting approximately 15 microseconds in release mode (two runs).

bvssvni commented 9 years ago

In debug mode I get 46 microseconds, which is about 3 times slower than release mode.

screen shot 2015-11-15 at 20 26 21

bvssvni commented 9 years ago

One weakness with the texture_swap estimate is that performance is sensitive to the size of the textures. I expect it to be have approximately same characteristics across hardware, such that you could calculate the worst case for a texture of a given size.

bvssvni commented 9 years ago

I generalized the spread sheet for estimating O(N) stuff in Turbine. Here I measure buttons using Conrod 0.22.2 with rustc 1.6.0-nightly (5b4986fa5 2015-11-08):

screen shot 2015-11-15 at 21 52 38

I get about 69 microseconds per button. You can see the curve bends slightly up, which is probably why the accuracy of the prediction is around 85%. The more buttons, the longer time it spends per button.

Notice that one button is ignored, it becomes part of the background overhead.

bvssvni commented 9 years ago

Here are buttons with Conrod https://github.com/mitchmindtree/conrod/commit/861726a4c7b03b3af918047470fc9ed83b76c55c (before Elmesque is removed) on rustc 1.6.0-nightly (5b4986fa5 2015-11-08):

screen shot 2015-11-15 at 22 10 39

As before when I measures total time, this is a little slower. It also shows that Conrod spends more time per button when adding more buttons, in comparison to 0.22.2. An ideal O(N) algorithm would have accuracy of 100%, but this shows 79%.

This type of estimation could be useful, not just checking how fast it is, but also see if changes improves algorithm complexity.

mitchmindtree commented 9 years ago

Will take a look at this more closely soon, but just thought I'd mention that elmesque hasn't actually been removed in that new PR just yet, and I would expect it to be a bit slower in its current state :) I'll let you know once I've actually removed it and expect things to be faster (y)

On Mon, 16 Nov 2015 08:26 Sven Nilsen notifications@github.com wrote:

Here are buttons with Conrod mitchmindtree/conrod@861726a https://github.com/mitchmindtree/conrod/commit/861726a4c7b03b3af918047470fc9ed83b76c55c (before Elmesque is removed) on rustc 1.6.0-nightly (5b4986fa5 2015-11-08) :

[image: screen shot 2015-11-15 at 22 10 39] https://cloud.githubusercontent.com/assets/1743862/11171040/bbdfa78c-8be5-11e5-8554-a73249003b93.png

As before when I measures total time, this is a little slower. It also shows that Conrod spends more time per button when adding more buttons, in comparison to 0.22.2. An ideal O(N) algorithm would have accuracy of 100%, but this shows 79%.

This type of estimation could be useful, not just checking how fast it is, but also see if changes improves algorithm complexity.

— Reply to this email directly or view it on GitHub https://github.com/PistonDevelopers/turbine/issues/31#issuecomment-156855314 .

bvssvni commented 9 years ago

@mitchmindtree Yeah, I knew that. I'm doing this to test the method so we know what it says. I wrote "before Elmesque is removed" where it is relevant.

bvssvni commented 9 years ago

Measuring buttons in debug mode using Conrod 0.22.2 with rustc 1.6.0-nightly (5b4986fa5 2015-11-08):

screen shot 2015-11-15 at 23 37 50

About 416 microseconds per button, this is 6 times slower than release mode.

This shows something interesting, that the algorithm becomes almost linear in debug mode. I think it is because the overhead by design drowns in the noise and only becomes significant when the compiler generates optimized machine code. Maybe an indicator that extra allocations doesn't matter compared to optimization, which is a bit surprising.