Very slow Nuklear calls from demo code

LWJGL / lwjgl3

LWJGL is a Java library that enables cross-platform access to popular native APIs useful in the development of graphics (OpenGL, Vulkan, bgfx), audio (OpenAL, Opus), parallel computing (OpenCL, CUDA) and XR (OpenVR, LibOVR, OpenXR) applications.

https://www.lwjgl.org

BSD 3-Clause "New" or "Revised" License

4.81k stars 640 forks source link

Very slow Nuklear calls from demo code #313

Closed dustContributor closed 7 years ago

dustContributor commented 7 years ago

Hi! So I was testing out nk_text_wrap since I'd like to implement a dialogue scrollable window thingy (like classic isometric RPGs).

My code for font handling is pretty much what the Nuklear GLFW demo is doing. Normally in my test scene with like a few cubes and things takes around 0.8ms per frame, when added a text wrap widget with a bunch of text, performance came to a "crawl" (as in, perfectly playable 22ms per frame, but really awful considering I'm coming from <1ms per frame).

My test text looks like "asd asdasdas dasdasd asdas" repeated over and over, 4k characters in total, which is roughly a couple paragraphs of text (ie, imagine a dialogue with a plot exposition NPC :P).

The top 3 calls that popped in the profiler (VisualVM sampler with highest resolution possible) were:

stbtt_GetCodepointHMetrics from here https://github.com/LWJGL/lwjgl3/blob/master/modules/core/src/test/java/org/lwjgl/demo/nuklear/GLFWDemo.java#L219
nnk_utf_decode from here https://github.com/LWJGL/lwjgl3/blob/master/modules/core/src/test/java/org/lwjgl/demo/nuklear/GLFWDemo.java#L223
nk_text_wrap from my view code (probably unavoidable).

First call was easy to circumvent, I just made an int -> int map, and cached the resulting 'advance' int like this:

int codepoint = unicode.get( 0 );
// This is a fastutil int2int map.
int advanceValue = codePointCache.get( codepoint );
// Integer.MIN_VALUE acts as 'missing' flag.
if ( advanceValue == Integer.MIN_VALUE )
{
  stbtt_GetCodepointHMetrics( fontInfo, codepoint, advance, null );
  codePointCache.put( codepoint, advanceValue = advance.get( 0 ) );
}
// Put it back where it should be.
advance.put( 0, advanceValue );

Quick, silly and dirty but it gets around the first call.

Without that call I'm down to 8ms per frame, which is nice but the profiler shows I'm still spending like 80% of the time in the second call, nnk_utf_decode.

Now I'm not sure what to do there nor what is that function supposed to do.

Also, say that I wanted to cache all stbtt_GetCodepointHMetrics results ahead of time instead of lazily as how I am doing right now, how I'd go about that? ie, where these "codepoints" from Nuklear are coming from?

I'd appreciate any input on this, not sure if it'd have a better match in Nuklear's repo.

Spasi commented 7 years ago

when added a text wrap widget with a bunch of text, performance came to a "crawl" (as in, perfectly playable 22ms per frame, but really awful considering I'm coming from <1ms per frame).

I knew Nuklear text rendering is not very efficient, because of the callback overhead in the JVM. But 22ms sounds excessive. Could you please post a small demo I can experiment with?

I'm still spending like 80% of the time in the second call, nnk_utf_decode. Now I'm not sure what to do there nor what is that function supposed to do.

This function decodes UTF-8 bytes in the text buffer and returns the next available unicode codepoint. I.e. it consumes one or more bytes (that may correspond to a Java char, or two if the codepoint is a surrogate pair). The GLFWDemo doesn't care about performance and calls it in a loop. This will obviously have a performance impact on long strings (e.g. there's the JNI overhead). I'll work on a Java implementation and post it here and maybe add it as a utility in LWJGL.

But again, 8ms for 4k characters sounds too much, so there may be another problem here. That's why I want to see your code.

say that I wanted to cache all stbtt_GetCodepointHMetrics results ahead of time instead of lazily as how I am doing right now, how I'd go about that? ie, where these "codepoints" from Nuklear are coming from?

Well, you know the text you're rendering, so you can cache everything in advance. Both the horizontal metrics (or other metrics you may need) and the texture quad data (returned by the NkUserFont::query callback). This should be straightforward, unless you need to support dynamic i18n text (say, in-game chat in chinese) where a dynamic caching scheme is more appropriate (for your font textures too).

The Nuklear codepoints are simply the unicode codepoints in your text. The Java String and Character classes have plenty of methods that help you deal with codepoints.

Spasi commented 7 years ago

You can replace nk_utf_decode with this:

private static int decodeUTF8(long text, IntBuffer unicodeCP) {
    int b0 = memGetByte(text);
    if (0 <= b0) {
        unicodeCP.put(0, b0);
        return 1;
    } else if ((b0 >> 5) == -2 && (b0 & 0x1E) != 0) {
        int b1 = memGetByte(text + 1);
        unicodeCP.put(0, (((b0 << 6) ^ b1) ^ (((byte)0xC0 << 6) ^ ((byte)0x80 << 0))));
        return 2;
    } else if ((b0 >> 4) == -2) {
        int b1 = memGetByte(text + 1);
        int b2 = memGetByte(text + 2);
        unicodeCP.put(0, ((b0 << 12) ^ (b1 << 6) ^ (b2 ^ (((byte)0xE0 << 12) ^ ((byte)0x80 << 6) ^ ((byte)0x80 << 0)))));
        return 3;
    } else if ((b0 >> 3) == -2) {
        int b1 = memGetByte(text + 1);
        int b2 = memGetByte(text + 2);
        int b3 = memGetByte(text + 3);
        unicodeCP.put(0, ((b0 << 18) ^ (b1 << 12) ^ (b2 << 6) ^ (b3 ^ ((byte)0xF0 << 18 ^ ((byte)0x80 << 12) ^ ((byte)0x80 << 6) ^ ((byte)0x80 << 0)))));
        return 4;
    } else {
        throw new RuntimeException("Malformed character sequence");
    }
}

It assumes the text is well-formed, so does no validation and drops the length argument. If nk_utf_decode is indeed the bottleneck, you should see a measurable difference.

dustContributor commented 7 years ago

All right, I grabbed the GLFWDemo.java file you have in the repo.

Made the following changes:

Removed vsync via glfwSwapInterval( 0 );
Removed Calculator.layout and Demo.layout calls so nothing else gets drawn.
Added both the codepoint cache (albeit with HashMap<Integer,Integer>, so with a few differences from the previous snippet) and your decodeUTF8 method.
Drawing this view class below:

public class TextTest {
  private final IntBuffer compression = MemoryUtil.memAllocInt( 1 ).put( 0, 20 );
    private final String text;
    {
      StringBuilder bld = new StringBuilder( 2048 );
      for ( int i = 100; i-- > 0; ) {
      bld.append( "asddas asd asd ssd asd ddsasd asdasd " + i + ". " );
      }
      text = bld.toString();
    }

    public void accept ( NkContext ctx, int width, int height ) {
      try ( MemoryStack stack = stackPush() ) {
        final NkRect rect = NkRect.mallocStack( stack );
        // Widget takes 1/6 of screen height, and 1/3 of the width, centered at the bottom.
        int windowWidth = width / 3 * 2;
        int windowHeight = height / 6;
        int xpos = width / 2 - windowWidth / 2;
        int ypos = height - windowHeight;

        if ( nk_begin( ctx, "Asd", nk_rect( xpos, ypos, windowWidth, windowHeight, rect ), 0 ) ) {
          nk_layout_row_dynamic( ctx, 25, 1 );
          nk_label( ctx, "Header", NK_TEXT_CENTERED );
          nk_layout_row_dynamic( ctx, 25 * 50, 1 );
          nk_text_wrap( ctx, text );
          nk_layout_row_dynamic( ctx, 25, 1 );
          nk_property_int( ctx, "Compression ", 0, compression, 100, 10, 1 );
        }
      nk_end( ctx );
    }
  }
}

Results are:

Without cache nor Java side decoding: 13ms to 14ms per frame.
With Java side decoding: 11.5ms to 12ms.
With codepoint cache: 5.5ms to 6ms per frame.
With both cache and Java side decoding: 3.3ms to 3.5ms.

In the last case, the nnk_text_wrap call is the one showing in the sampling profiler, taking 70% of the time. Second call is DynCallback.ndcbArgPointer with 7% of the time. It shows 4% GPU utilization so it doesn't seems Nuklear is generating excessive geometry or something (or at least if it does, it's cheap enough to upload it to the GPU).

For comparison, my project can render a few omni lights, a directional light, and a bunch of cubes, with deferred rendering, normalmapping and shadowmapping for the directional light in less than 1ms per frame.

If I render the default Demo and Calculator views, without caches (so pretty much like the default GLFWDemo), it takes 0.5ms to 0.8ms per frame. But it has very little text.

My code to measure frame time is like this:

double acc = 0.0d;
int frames = 0;

while ( !glfwWindowShouldClose(win) ) {
  long timerStart = System.nanoTime();
  ++frames;
 /* All rendering here. */
  long diff = (System.nanoTime() - timerStart);
  acc += (diff / 1000000.0d);
  if(acc > 1000.0d) {
    System.out.printf( "Millis per frame ----- %.3f", Double.valueOf( acc / frames ) );
    System.out.printf( "FPS: %.3f", Double.valueOf( frames ) );
    System.out.println();
    acc = 0;
    frames = 0;
  }
}

Spasi commented 7 years ago

Thanks, I have confirmed my suspicion that the width callback is being called many more times than necessary. The problem is the internal nk_text_clamp function that has quadtratic behavior. If you have a string like "foobar", the callback will be called once for each of:

f
fo
foo
foob
fooba
foobar

You can imagine how inefficient this becomes when applied to each line of your long text. A workaround is to cache the last calculated width, like so:

// Disclaimer: not tested thoroughly
.width(new NkTextWidthCallback() {
    private long lastText;
    private int lastByteOffset;
    private float lastTextWidth;

    @Override public float invoke(long handle, float h, long text, int len) {
        float textWidth = 0.0f;
        try (MemoryStack stack = stackPush()) {
            IntBuffer unicode = stack.mallocInt(1);

            int byteOffset;
            if (lastText == text && lastByteOffset <= len) {
                byteOffset = lastByteOffset;
                textWidth = lastTextWidth;
            } else {
                byteOffset = 0;
                lastText = text;
                lastByteOffset = 0;
                lastTextWidth = 0.0f;
            }

            int currentGlyphBytes = nnk_utf_decode(text, memAddress(unicode), len);
            if (currentGlyphBytes == 0) {
                return lastTextWidth;
            }

            IntBuffer advance = stack.mallocInt(1);
            while (byteOffset <= len && currentGlyphBytes != 0) {
                if (unicode.get(0) == NK_UTF_INVALID) {
                    break;
                }

                /* query currently drawn glyph information */
                stbtt_GetCodepointHMetrics(fontInfo, unicode.get(0), advance, null);
                textWidth += advance.get(0) * scale;

                /* offset next glyph */
                currentGlyphBytes = nnk_utf_decode(text + byteOffset, memAddress(unicode), len - byteOffset);
                byteOffset += currentGlyphBytes;
            }
            lastByteOffset = byteOffset;
        }
        lastTextWidth = textWidth;
        return textWidth;
    }
})

Performance is reasonable with the above code. Using a metrics cache and Java UTF8 decoding on top makes it even faster, but it's not as impressive as before.

It might be a good idea to report this to Nuklear. Not sure if performance can be improved without new API though.

dustContributor commented 7 years ago

I see, makes sense. @vurtun Hello? How do I lots of text in Nuklear? We're having issues here with text wrapping widget with long (4k+ characters) strings. Apparently nk_text_wrap eatus up the CPU with long strings due internal behavior. I believe what I am intending to do is fairly reasonable, think something like Pillars of Eternity or Tyranny dialogue boxes, they have lots of scrollable text since conversations with NPCs can be long. But if it takes half of my frame budget I'm in trouble :sob:

Hopefully he has notifications enabled :smile:

vurtun commented 7 years ago

Hey @dustContributor and @Spasi. Text is an absolute nightmare in nuklear or rather this type of GUI design, since everything needs to be recalculated and drawn each frame. That is especially noticeable with text since it involves font glyph lookup, text width calculation requiring a text width callback as well as UTF-8 rune decoding which are awful performance hogs.

One issue I saw was related to high vertex count caused by text in nuklear which seems to quite high and the other was general slowness of text width calculations. Biggest problem here is that nk_text_wrap was not designed for a lot of text.

I believe what I am intending to do is fairly reasonable, think something like Pillars of Eternity or Tyranny dialogue boxes,

My question here is are these dialog boxes persistent? If so I would recommend creating some custom text state for these dialog boxes to store metrics for each line that only need to be recalculated if the size changes. Then for drawing you could iterate over each line and draw by nk_draw_text. This would also allow nuklear to clip text that is currently not visible and therefore hopefully further improves performance and vertex count.

By the way the reason why text takes up so much memory is because each glyph takes up 4 vertexes plus 6 indicies so if you multiply that out for all text visible that is quite a lot.

dustContributor commented 7 years ago

My question here is are these dialog boxes persistent?

No idea what you mean by "persistent" but they're like this

You can scroll up to see the previous responses in the widget, you have differently colored text in the same lines, and the responses are selectable text basically. Not quite sure how to do the last one in Nuklear. Moreover, a feature they added in Tyranny was text that when you hovered over, it'd spawn a popup explaining what it was. Say, when someone said "Back in the [War of the Tides] we...", hovering over "War of the Tides" would spawn a pop up explaining what that meant. Dialogue box isn't closed until you stop talking with the NPC.

In short, what you're saying is that I should do the text wrapping widget myself and use Nuklear like more of a drawing API. I guess I can do that, it'll take me a while with scrolling and all.

By the way the reason why text takes up so much memory is because each glyph takes up 4 vertexes plus 6 indicies so if you multiply that out for all text visible that is quite a lot.

Eh, say that we have 8k characters visible (I dunno, player is reading an in-game book or something), that is 4 verts 8000 chars 4 bytes per vert + 6 indices 8000 chars 2 bytes per index. 220kB. My GTX680 from 5 years ago has... 2GB. I think it should be fine.

But that's secondary. I think the CPU time spent is way more important given it can shave off half of your expected rendering time.

httpdigest commented 7 years ago

I think the CPU time spent is way more important given it can shave off half of your expected rendering time.

For static content there is always the obvious solution of simply rendering once into a (compressed) texture and drawing a texture-mapped quad afterwards.

Actually, I would have never imagined STBTrueType to be something which should recalculate and rerender the glyphs each frame. I would always bake it into a texture and render that texture then.

vurtun commented 7 years ago

No idea what you mean by "persistent" but they're like this

Persistent as in frame persistent. Nuklear throws away almost everything every frame. As it looks to me your dialog lifetime is persistent.

You can scroll up to see the previous responses in the widget, you have differently colored text in the same lines, and the responses are selectable text basically. Not quite sure how to do the last one in Nuklear.

How I would do it is have a class/struct for the dialog. The dialog object would both hold the complete string as well as a list of text objects (each holding number of characters in line or in your dialog case for different colored text sections as well) that get calculated in the beginning or on resize. Then for UI I would use a nk_group and iterate over all text objects and draw them with nk_text. The scrollbar would be handled by nuklear while the selectable options could be done with nk_selectable.

dustContributor commented 7 years ago

I would have never imagined STBTrueType to be something which should recalculate and rerender the glyphs each frame. I would always bake it into a texture and render that texture then.

Yup, that's exactly what the demo code does, we're not talking about STB's font rendering. It's about the glyph width calculation Nuklear uses for computing text line widths, which decides what geometry vertices/indices Nuklear generates. Glyphs are already there in the GPU, not the issue there given it's relatively very few draw calls (3 for all the text I was showing in my tests, for example) and no state changes whatsoever besides binding the baked glyph texture once.

What we did with Spasi was to cache the width results (among other things), which gets rid of a lot of the run time of each glyph width call. The next issue is that it's just called a lot for long texts in the nk_text_wrap widget, so no matter how cheap we make those 'width' calls, it's going to drag frame time down. Something a stateful widget could improve on.

The text in its entirety could be baked into a texture, then simulate the scrolling via offsetting the UVs, but I don't think it's really necessary to go that far. Avoiding the 'width' calls should be enough for normal game text usage and then some, since in the grand scheme of things it's not that much geometry, and its drawn with very few calls/state changes anyway.

However, it could become a problem as you add things like various fonts in various sizes, different colors and/or effects, etc. ie, piling up draw calls and state changes. I don't have such test case yet though, but I'm not too worried. I doubt any game UI has that much text crazy-ness going on, and even if it did, I'm not planning on doing such thing anyway.

How I would do it is [...]

Ah I see, I'll check out nk_selectable then. I'm guessing I'll have one nk_text per colored text right? What would the purpose of nk_group be in this case? I was just thinking of drawing each nk_text line in a static row and add as many as I needed.

vurtun commented 7 years ago

What would the purpose of nk_group be in this case?

nk_group mainly for scrollbar handling. Of course you could use nk_begin and nk_end as well depending on how you structure your code.

dustContributor commented 7 years ago

Okay but what nk_group has to do with scrolling? I'm not finding any related doc in the sources that explains what they're used for.

vurtun commented 7 years ago

nk_group is for panels inside panels. It is mainly used as a layouting widget. You could also say it is a scrollable section inside a panel. For example the nodes inside the node editor are groups.

dustContributor commented 7 years ago

Alright, thanks a lot! I'll see what I can do. I guess I'll go ahead and close this since it isn't LWJGL related.