rendering infrastructure overhaul

Aeva commented 8 years ago

abstract

M.GRL performs very poorly when a modest number of dynamic objects are in play at once. But, this is a fixable problem, and I have a much better idea as to what kind of rendering architecture is needed to support this now than I did when I started down the rabbit hole of "adding WebGL" a few years ago.

The following tickets contain relevant exploratory research on how to resolve the shortcomings of the engine: #189 (consolidated VBOs), #193 (related stress test), #195 (using instancing extension instead), #197 (glsl compiler support for instancing abstraction).

The most striking part of pursuing the instancing extension was one of my "control group" tests: just flattening out the rendering graph to draw as many of one object as possible. No extensions, no VBO foolery, just ~no branching what-so-ever~ removing expensive operations and any render-time solving. The results were good enough (though it seems I didn't feel the need to document them anywhere obvious) that it became evident that I needed to employ code generation for any execution pathway that is time sensitive. The rendering bottleneck is entirely CPU bound, so that should be made to run as fast as possible, and that means removing unnecessary work at draw time. edit to be clear, branching is actually not that expensive, but the driver properties scheme I came up with at the beginning is.

~~Combined with the alternative API proposed in #230,~~ it should be also possible to create static elements in a scene without creating graph nodes for them, which was the show stopping bottleneck for the instancing demo in terms of startup time as well as extreme memory consumption.

the new rendering pipeline

The way the new rendering pipeline works like this:

One creates a RenderNode with a specific shader program, and assigns it a GraphNode
Instancing a JTA asset creates a DrawableIR object associated to the instance, which creates IR bindings and can be used to generate the shader-specific IR needed to draw the asset and prune redundant gl calls by being context aware. Signals are used to keep these up to date, as well as to determine if a property is dynamic or static.
JTA instances are added into the graph as normal. The RenderNode's render function will be regenerated whenever there is a significant change, such as a new object being added, or a property switching from static to dynamic.
GraphNodes can be forced to be all static, by calling their "freeze" method. This will greatly speed up the rendering performance, but the node will never be usable as a dynamic node again.

There is also now room for the idea of instancing something as fully static right away, bypassing the GraphNode constructor completely.

For drawables that do not yet implement the __ir object, the old rendering pathway is called at the end, but objects in the new rendering pathway are omitted from sorting. It might be desired to force objects into the old pathway anyway, so that instancing them does not trigger a recompile event.

related areas for improvement

These should probably be forked out into other issues, but are worth mentioning here for now:

Shader program context switching can be made a ton faster by assuming that an asset attach is created with a specific shader in mind, and only has those bindings. This would allow us to remove a ton of extra work when the shader program needs to be regenerated. The whole connecting uniforms and enabling / disabling of attributes could also be sped up via code generation

It is possible that the compositing graph could also be similarly improved, since its behavior is pretty static once established.

fuzzy bits

[x] ~how and where does object picking fit into this? #192 covers some concerns as to what should change in the picking system.~ #237 resolves this.

~Code generation should also be used here. With static draw functions being generated per RenderNode (and thus per shader) now, this should be pretty trivial to implement.~

[x] how do particle systems work in this new scheme?

It looks like particles will work more or less fine as is. The new draw functions as implemented also allow for legacy drawing, at least until everything is ported over.

[x] transparency and z sorting

I'm at a loss of a way to both be able to do z-sorting and gain the benefit of redundant call pruning. But, if that existed, then this codegen branch would probably be irrelevant.

Lacking an obvious ideal way, forcing all alpha blended objects to be rendered via the legacy path is probably an ok short term solution.

In the long term, the answer might be to make it so that transparency stuff can be done in a separate rendering pass and then easily composited together with the main rendering pass. This would have the added benefit of making it easier to swap out OIT algorithms and the likes. The downside is that unified renderers maybe are more complex to write? But that might be unavoidable anyway.

It seems likely that the long term approach will benefit from an easy way for RenderNodes to customize the codegen process.

[x] GraphNode visibility switching

The solution here is probably something similar to treating the object's visibility as a dynamic property, but it results in putting it's IR inside of an if-block. It is maybe slightly more complicated than that, because it should also be aware of any parent's visibility.

the IR

The term "IR" is a little fuzzy at the moment, but it is one of the following:

The DrawableIR object is not IR, but it is used to generate IR, and tracks several JSIR objects for shader bindings.
A list of strings and JSIR objects. JSIR objects do some book keeping to determine if the parameters (to say, a function call) should be read from a variable or hard coded in the final result.

JSIR objects currently represent either function calls or assignment statements.

development plan

This is going to be a long running task, and will be made easiest if I can break it down into small parts. Development will happen in the branch fast_graph. A loose outline of where to start follows:

[x] remove static draw node functionality entirely
[x] create the IR object (see section above), add tests
[x] remove "manual cache invalidation" mechanism from GraphNodes
[x] update m.gl.buffers to generate source for bind and draw functions, mark old paths as deprecated.
[x] scheme for generating IR for JTA GraphNode instances (use cache lines, no static vars)
[x] event hook for flagging ir dirty when writing to a driver property
[x] automatic regeneration of static draw function, "static first" strategy for graph nodes IR
[x] generate combined IR on the graph root
[x] generate draw function from combined IR
[x] testing, profiling
[x] backport picking (see ~~#192~~ #237)
[ ] fix broken demos
[ ] alpha passes?
[ ] visibility?
[x] update demos to use GraphNode.freeze
[ ] testing
[ ] merge into master

epilogue

If performance is good and all of the demos still work, then fast_graph should be merged into master. The following things can then be made into new issues, building off of this work:

[ ] scheme for generating IR for image GraphNode instances
[ ] scheme for generating IR for gani GraphNode instances
[x] graphless IR generation for static non-animated JTA attaches
[ ] graphless IR generation for static image attaches
[ ] throw an error for static attaches of animated assets (some JTA, gani)?
[x] revise dynamic attaches to use static IR first and use cached values if made dynamic.
[x] demo to show off static terrain or something (eg random nethack level)
[x] instancing for large numbers of static objects (see #246)

With the graph overhaul done, the following would be the next ideal place to work on:

[ ] ~~attach API? (see #230)~~ (maybe not?)
[ ] ~~revise GraphNode creation APIs to require the desired shader for bindings to be generated~~ Not really needed now? Maybe more like, update the existing shader re-binding scheme to use signals instead of events and be more efficient? This is probably the scope for a new issue.
[ ] remove rebind event
[ ] consider employing codegen to speed up the compositing graph?

Aeva commented 8 years ago

This'll probably keep me busy until ~~September XD~~ ~~January D:~~ sometime in 2017 D: D: D:

Aeva commented 8 years ago

Good progress this weekend - the fast_graph branch now has JTA instances using code generation for the draw paths. Without applying things like state sorting or having the entire draw path as a generated function, performance seems to be about the same. If anything, it feels a like it gets up to speed faster, which is encouraging, though I don't have a good way to measure this. Average frame rate is about the same, which is good given that this is the unoptimized form.

This path way is a lot stricter about binding contexts - if a uniform isn't marked as such, then it doesn't get an IR in the codegen path. This breaks most of the demos. I think once I have a regen mechanism in place, that same path could be used to add more uniform bindings as needed.

This is what I think needs to be done next:

[x] an ir_dirty event on graph nodes - this should be triggered whenever a '.shader' driver property is set
[x] use the above to allow IR to change from static to dynamic
[x] use the above to trigger the __static_draw path to be replaced
[x] use the above to trigger more IR objects to be added to the __static_draw path

Once this is done, the demos seem to no longer be broken due to overly strict binding context behavior then I think the next step would be to start on creating a monolithic draw function.

Aeva commented 8 years ago

Ok, this seems to work really great now. The location_picking demo now runs at 60fps without using any fancy psuedo-static modes.

[x] use the dirty signals to trigger more IR objects to be added to the __static_draw path
[x] remove __static_draw, replace with just a per-node src cache & generate a monolithic static function on the graph root

Aeva commented 8 years ago

Todo:

[x] some sort of mechanism for creating multiple static draw functions for different shader programs [1]
[x] fix regression where properties that should end up static aren't (see 'mode' in demo 6)
[x] update unit tests for codegen

Other Regressions:

[x] site demo doesn't work (missing attribute 'normals') [1]
[x] compositing demo doesn't work (missing attribute 'tcoords') [1]
[ ] particle system stops rendering after transition effect (not sure why, though there is a GL error)
[x] glsl compiler demo doesn't work at all (needs to be able late derive uniform bindings maybe?)

Other Bugs:

[x] "gavroche hall" demo shows transparent 'psycho', but shouldn't. Instead, the static draw path should also be able to handle alpha blended things, and static object should be omitted from the dynamic alpha pass. this is not a bug - 'psycho' is being drawn in the static path and not in the alpha path. It just seems that the blending mode allows it show up, but it won't get z-sorted.

Performance Observations:

Demos seem to run marginally faster (~5 fps for the light, ~15 fps for the location_picking demo if you remove all of the static drawing hacks) - seems like the cost for having more dynamic objects is lower now. There is a LOT of room for new optimizations with this new system though :)
"warm up" time seems a lot faster now - I have no idea why this is, and it was completely unexpected, but cool :)
fps seems a little more stable now?

[1] adaptive bindings

The missing attribute errors are caused when the graph is initialized without activating the desired shader program first. Therefor, the compiled static draw methods assume that cache.prog.attrs[attr_name] exists when required, regardless of if the shader program provides it.

It is currently an assumption in M.GRL that a given graph root can be rendered with multiple renderers, and it is desirable to me to keep this flexibility. It might be that the simplest solution to this problem is to make the static drawing function the domain of render nodes and not the graph root, and try to move towards treating the graph as a data structure.

Aeva commented 8 years ago

Some thoughts for where to go next:

I think it might make sense to put together some basic state sorting while keeping in mind that I should put together the picking pass stuff next.

The static draw function should be moved from the graph root into the render nodes. This fixes the problem of needing different bindings for different shaders and also eliminates the need to associated an entire graph with a specific shader program. Compilation would likely be triggered via signals. is done
The "ir list" that is generated in m.codegen.js (called by m.jta.js) and consumed by m.graph.js should be replaced with an object with some properties that we can use for sorting (vbo_id, ibo_id, current static sampler bindings, etc)
The object should be able to compile to one of several "profiles". So far, this is probably just "generic" and "picking".
Throw together some rudimentary state sorting. Probably sort for VBO, IBO, static samplers, and then static unis.
~~Eliminate redundant static calls, ideally on the fly.~~

It would be good if we could eliminate some redundancy by doing the picking stuff in tandem with the generic draw pass. I think this would work ok, because the picking pass is just a uniform and the binds, so the more constrained sorting could just be re-used.

Aeva commented 8 years ago

Ok, the graph draw function is now generated by the RenderNode (which has a 1:1 relationship with a shader program) rather than by the SceneGraph (who's objects can be associated to many RenderNodes and shader programs).

I added a lispy 'quote' macro, so the entire static draw function is just generated in one go and replaces the RenderNode's 'render' method.

As mentioned in commit a51d55336080f8f09ddaaf1a84920fc396eb2b80, there are some places where I feel this is half-assed and could use revision:

[x] Cannot use cpp macros inside the quote macro. We might benefit from a different syntax that cpp doesn't trip over. Ideally we could run cpp first. Considering trying the QUOTE {...} syntax instead.
[x] Intelligent whitespace would be nice in the generated quoted regions.
[ ] commented out some "exclude test" stuff, probably should just remove it entirely, or otherwise add a mechanism to the rendernode for defining exclusions. might be useful for being able to compile different static draw functions for the graph
[ ] dummied out SceneGraph's "__create_picking_node" method because it causes problems as is. Could be fixed by running it out of phase via set timeout, so that it doesn't try to compile the graph node before the graph node finished initializing...?
[x] a mechanism for swapping out the shader program on the RenderNode would be nice. it should also trigger a recompile event so that the code is accurate.

Aeva commented 8 years ago

also note the following demos are still broken, but should be easier to resolve now:

[x] site demo doesn't work (missing attribute 'normals') [1]
[x] compositing demo doesn't work (missing attribute 'tcoords') [1]
[ ] particle system stops rendering after transition effect (not sure why, though there is a GL error)
[x] glsl compiler demo doesn't work at all (needs to be able late derive uniform bindings maybe?)

Aeva commented 8 years ago

Currently the jta resources generate their IR once, at load time, with respect to the currently active shader, and never again. The main problem this results in, is that the attributes and uniforms won't necessarily match what is needed to draw under the desired shader program.

The ideal case might instead be an IR object, which puts together the initial bindings, but is able to dynamically add new bindings as needed. Then, when it is time to compile, a method would be called to return the specific IR list needed to draw the object in the current shader.

There is now a 'DrawableIR' object that has a "generate" function that takes a shader program as an argument and outputs an IR list that can be compiled. This also does the same kind of binding tracking as before.

Some notes:

the compositing demo renders the pip's contents wrong, but that is kind of expected
elimination of redundant states could be accomplished via the generate function

Aeva commented 8 years ago

Currently the location picking demo in master is faster (60fsp average on the x200) than the one in fast_graph (40fps). However, the one in master has some hacks to make some drawables immutable, whereas the tiles in fast_graph are all dynamic (so they call and upload 4 driver functions each before drawing). ... ~~Though when I disable that optimization, it still hits 60, which doesn't seem right o.O~~ When I disable the optimizations (static draw node, manual cache invalidation) on the demo in master, it drops down to about 28fps. Woo :)

So using the location_picking demo to compare only dynamic rendering, fast branch is about 12fps faster than master on my x200 :)

Aeva commented 8 years ago

I added a "freeze" method to GraphNode instances, which causes them to be drawn entirely as static objects. The performance on location_picking shoots up to 60hz in testing with it, giving equivalent or better performance to for static drawing but with a much nicer API and memory usage footprint.

Dynamic drawing is still slow (but marginally faster!) for large numbers of objects.

Aeva commented 8 years ago

Current regressions in demos:

[ ] light demo is a blank screen ✶
[x] compositing demo's pip is just a solid color ✶
[ ] particle demo's particles don't render after loading screen transition effect ends
[x] anything relying on picking is broken

✶ In a newer web browser than the one I've been using, it reports that there is are gl errors thrown stating that the ranges are wrong for [gl draw elements?] calls. Unsure if that is actually the problem with the light demo, though, because fussing around with a picture-in-picture pass at least shows that the gbuffers are rendering correctly.

tasks:

[x] implement picking (see #192)
[ ] how should alpha blending be handled?
[x] something to replace the old 'exclude test' mechanism (light demo used this)
[ ] rewrite codegen tests

Aeva commented 8 years ago

Basic outline for re-implementing object picking:

PickingNode singleton
- low resolution buffer
- use the draw function generator from RenderNode
- all nodes considered pickable, for now (except particles)
- shader should limit what properties need to be set
- picking index should work a little differently: statics should be given incremented IDs upon the picking pass being compiled. Non statics should be enumerated in gl_tick as currently, but not starting from 0
- graph root ".set_picking_target()". First graph root instance gets it for free.
new picking pass shader (mrt, floating point textures etc) see #192
connect to existing API
remove dead code from old picking system
test, debug, etc
~~burn-in mechansim?~~ Probably not needed, especially if the end goal is to target asm.js. The rational being that if mouseover is turned on, then this will effectively do the burn in, and if they aren't, picking events will happen infrequently enough that it won't be needed.
"classic" variation that doesn't use mrt or floating point buffers

Aeva commented 8 years ago

Got sidetracked exploring building a simpler case study for codegen as an optimization for a blog post that didn't play out as expected. Which is fine I guess, as it helped clarify how mgrl's NOGL renderer should work.

In the mean time, I've been puzzling over what would be the best to do with alpha blending, and noted some things in the ticket's description above. Might continue with either that or picking next. The two have some similarities though, so it might be good for me to sleep it over a little more and figure out what infrastructure can support both best.

Aeva commented 8 years ago

Moved reimplementing the picking system out into issue #237 to keep notes tidy.

Aeva commented 7 years ago

Merged pick_nouveau into fast_graph, now that issue #237 is closed!!!! :O

Now that the picking system has been revised, and all of the demos reliant on it have been updated, the following needs to be done:

Current regressions in demos:

[ ] light demo is a blank screen ✶
[x] particle demo's particles don't render after loading screen transition effect ends✶✶
[ ] artifacts on the site index demo
[x] any demo using the gavroche_hall.jta asset shows the error texture instead of the correct texture.✶✶✶
[ ] visibility does not work (slow_bezier relies on this).
[ ] anything relying on alpha blending :/
[x] "picking layer out of bounds: undefined" spam in some demos (eg multipass rendering)

✶ In a newer web browser than the one I've been using, it reports that there is are gl errors thrown stating that the ranges are wrong for [gl draw elements?] calls. Unsure if that is actually the problem with the light demo, though, because fussing around with a picture-in-picture pass at least shows that the gbuffers are rendering correctly.

✶✶ The particle demo uses the "image instance" drawing path, so maybe a good starting point is to see if that is also similarly broken. update1 yeah, this is likely the problem. Trying to instance the "error" image in another demo spits out this Error: WebGL: drawElements: bound vertex attribute buffers do not have sufficient size for given indices from the bound element array

✶✶✶ This seems to be because the image uri references in the jta file still use ":" as a delimiter, instead of "_". Commit f377f81515f5d8dff9a4d969cee3bdda74c5b64c renames a bunch of images files, but doesn't update any JTA files, so this bug is probably present in master. Manually correcting the paths in the JTA file makes the problem go away, but the fix this correctly, either the blend file needs to be updated, or the exporter does. I'm not sure which, and either way this is probably the scope of a new issue, especially since it is most likely present in master too. it is definitely that the images were renamed, but also that the search paths were wrong in a few demos.

tasks:

[ ] alpha sorting
[ ] object visibility
[ ] update demos to use GraphNode.freeze?

Aeva commented 7 years ago

Was feeling a desire to work on features instead of fixing things =) so, the last few commits implement an API for "stamping" assets into a SceneGraph instance without actually creating individual GraphNodes for them. The new "static_drawing" demo demonstrates the API for this. For drawing ~10200 objects, this isn't as fast as it could be running. ~~The problem is obvious when looking at the generated rendering code, where you get a few thousand lines of array declaration syntax... oops.~~

~~So, this could be improved by maybe instead, generating and storing a singular long float array at compile time, and referencing sections of that instead.~~

~~This could probably be accomplished by tracking the graph ID of the stamp, and storing the IR in an object instead of an array, allowing for the space needed to be allocated at compile time.~~

~~Sorting by the stamp's graph ID might not even be needed for this, though it would definitely be helpful later on if we were to also add the potential for automatic instancing with this, too.~~

There could be some kind of 'Heap' object, where when things get compiled, they request a lease in the heap, and the Heap object will return the starting index for their lease. This would allow for something like subarrays to be used here. At the end of the compilation process, the Heap object reserves the memory it needs. Also something something cache locality.

It could also use an ArrayBuffer for reserving memory, and a typed arrays to access the data, instead of using subarray. I think this would allow the heap to serve multiple data types, but I'm worried that creating a new typed array for every access would just result in lots of object churn. Idk, hopefully compilers would recognize that all I really want here is a pointer and type cast.

I have no idea what a good solution is. This is probably the threshold where games should just use instancing.

Aeva commented 7 years ago

Ok, new branch "heapexperiment" tracks trying to speed up the static drawing demo. It doesn't work right now though, for reasons which are detailed in the last commit message. ~~Considering forking this off into its own issue~~ #238, but gist is that the JS IR objects created for uniform uploads should be replaced with IR that actually just produces the actual GL call. From what I've been able to figure out, that will cause the 'new Float32Array' stuff in the complied output to basically function as a type cast instead of a new object. God damn it, javascript >>

Aeva commented 7 years ago

Closed out issue #238, and merged the resulting branch back into fast_graph.

Aeva commented 7 years ago

Only thing that is overtly broken at this point is the lighting demo (#244). Once that is fixed, its just a couple of loose ends before I'm comfortable merging this into master! O_O

Aeva commented 7 years ago

Did some profiling tonight, since I discovered google chrome's profiler doesn't have any problems with the generated functions, unlike firefox developer edition. Right now, one of the most expensive things is the "__regen_glsl_bindings" stuff, which is something I want to get rid of anyway.

The actual generated rendering functions occupy a very percentage of the run time (about 5%) in the lighting demo!!! Most of the demos run at 60fps (picking causes slight slow downs). Demos in master also run fairly well in chrome, maybe a few fps slower. The "static_drawing" demo runs at about 55fps (~40 in firefox) - the generated function ends up occupying the majority of the run time percentage, but at this point, the sheer bulk of gl calls (for 3721 objects) within it also show up right next to it, so I can't feel too bad about that. The static drawing demo shows that this is definitely a point where there would be a lot to gain from instancing.

Aeva commented 7 years ago

Merged the instancing branch into fast_graph <3

Aeva / m.grl