Aeva / m.grl

Midnight Graphics & Recreation Library
http://mgrl.midnightsisters.org
GNU Lesser General Public License v3.0
44 stars 3 forks source link

rendering infrastructure overhaul #232

Open Aeva opened 8 years ago

Aeva commented 8 years ago

abstract

M.GRL performs very poorly when a modest number of dynamic objects are in play at once. But, this is a fixable problem, and I have a much better idea as to what kind of rendering architecture is needed to support this now than I did when I started down the rabbit hole of "adding WebGL" a few years ago.

The following tickets contain relevant exploratory research on how to resolve the shortcomings of the engine: #189 (consolidated VBOs), #193 (related stress test), #195 (using instancing extension instead), #197 (glsl compiler support for instancing abstraction).

The most striking part of pursuing the instancing extension was one of my "control group" tests: just flattening out the rendering graph to draw as many of one object as possible. No extensions, no VBO foolery, just ~no branching what-so-ever~ removing expensive operations and any render-time solving. The results were good enough (though it seems I didn't feel the need to document them anywhere obvious) that it became evident that I needed to employ code generation for any execution pathway that is time sensitive. The rendering bottleneck is entirely CPU bound, so that should be made to run as fast as possible, and that means removing unnecessary work at draw time. edit to be clear, branching is actually not that expensive, but the driver properties scheme I came up with at the beginning is.

Combined with the alternative API proposed in #230, it should be also possible to create static elements in a scene without creating graph nodes for them, which was the show stopping bottleneck for the instancing demo in terms of startup time as well as extreme memory consumption.

the new rendering pipeline

The way the new rendering pipeline works like this:

  1. One creates a RenderNode with a specific shader program, and assigns it a GraphNode
  2. Instancing a JTA asset creates a DrawableIR object associated to the instance, which creates IR bindings and can be used to generate the shader-specific IR needed to draw the asset and prune redundant gl calls by being context aware. Signals are used to keep these up to date, as well as to determine if a property is dynamic or static.
  3. JTA instances are added into the graph as normal. The RenderNode's render function will be regenerated whenever there is a significant change, such as a new object being added, or a property switching from static to dynamic.
  4. GraphNodes can be forced to be all static, by calling their "freeze" method. This will greatly speed up the rendering performance, but the node will never be usable as a dynamic node again.

There is also now room for the idea of instancing something as fully static right away, bypassing the GraphNode constructor completely.

For drawables that do not yet implement the __ir object, the old rendering pathway is called at the end, but objects in the new rendering pathway are omitted from sorting. It might be desired to force objects into the old pathway anyway, so that instancing them does not trigger a recompile event.

related areas for improvement

These should probably be forked out into other issues, but are worth mentioning here for now:

Shader program context switching can be made a ton faster by assuming that an asset attach is created with a specific shader in mind, and only has those bindings. This would allow us to remove a ton of extra work when the shader program needs to be regenerated. The whole connecting uniforms and enabling / disabling of attributes could also be sped up via code generation

It is possible that the compositing graph could also be similarly improved, since its behavior is pretty static once established.

fuzzy bits

~Code generation should also be used here. With static draw functions being generated per RenderNode (and thus per shader) now, this should be pretty trivial to implement.~

It looks like particles will work more or less fine as is. The new draw functions as implemented also allow for legacy drawing, at least until everything is ported over.

I'm at a loss of a way to both be able to do z-sorting and gain the benefit of redundant call pruning. But, if that existed, then this codegen branch would probably be irrelevant.

Lacking an obvious ideal way, forcing all alpha blended objects to be rendered via the legacy path is probably an ok short term solution.

In the long term, the answer might be to make it so that transparency stuff can be done in a separate rendering pass and then easily composited together with the main rendering pass. This would have the added benefit of making it easier to swap out OIT algorithms and the likes. The downside is that unified renderers maybe are more complex to write? But that might be unavoidable anyway.

It seems likely that the long term approach will benefit from an easy way for RenderNodes to customize the codegen process.

The solution here is probably something similar to treating the object's visibility as a dynamic property, but it results in putting it's IR inside of an if-block. It is maybe slightly more complicated than that, because it should also be aware of any parent's visibility.

the IR

The term "IR" is a little fuzzy at the moment, but it is one of the following:

JSIR objects currently represent either function calls or assignment statements.

development plan

This is going to be a long running task, and will be made easiest if I can break it down into small parts. Development will happen in the branch fast_graph. A loose outline of where to start follows:

epilogue

If performance is good and all of the demos still work, then fast_graph should be merged into master. The following things can then be made into new issues, building off of this work:

With the graph overhaul done, the following would be the next ideal place to work on:

Aeva commented 8 years ago

This'll probably keep me busy until September XD January D: sometime in 2017 D: D: D:

Aeva commented 8 years ago

Good progress this weekend - the fast_graph branch now has JTA instances using code generation for the draw paths. Without applying things like state sorting or having the entire draw path as a generated function, performance seems to be about the same. If anything, it feels a like it gets up to speed faster, which is encouraging, though I don't have a good way to measure this. Average frame rate is about the same, which is good given that this is the unoptimized form.

This path way is a lot stricter about binding contexts - if a uniform isn't marked as such, then it doesn't get an IR in the codegen path. This breaks most of the demos. I think once I have a regen mechanism in place, that same path could be used to add more uniform bindings as needed.

This is what I think needs to be done next:

Once this is done, the demos seem to no longer be broken due to overly strict binding context behavior then I think the next step would be to start on creating a monolithic draw function.

Aeva commented 8 years ago

Ok, this seems to work really great now. The location_picking demo now runs at 60fps without using any fancy psuedo-static modes.

Aeva commented 8 years ago

Todo:

Other Regressions:

Other Bugs:

Performance Observations:

[1] adaptive bindings

The missing attribute errors are caused when the graph is initialized without activating the desired shader program first. Therefor, the compiled static draw methods assume that cache.prog.attrs[attr_name] exists when required, regardless of if the shader program provides it.

It is currently an assumption in M.GRL that a given graph root can be rendered with multiple renderers, and it is desirable to me to keep this flexibility. It might be that the simplest solution to this problem is to make the static drawing function the domain of render nodes and not the graph root, and try to move towards treating the graph as a data structure.

Aeva commented 8 years ago

Some thoughts for where to go next:

I think it might make sense to put together some basic state sorting while keeping in mind that I should put together the picking pass stuff next.

  1. The static draw function should be moved from the graph root into the render nodes. This fixes the problem of needing different bindings for different shaders and also eliminates the need to associated an entire graph with a specific shader program. Compilation would likely be triggered via signals. is done
  2. The "ir list" that is generated in m.codegen.js (called by m.jta.js) and consumed by m.graph.js should be replaced with an object with some properties that we can use for sorting (vbo_id, ibo_id, current static sampler bindings, etc)
  3. The object should be able to compile to one of several "profiles". So far, this is probably just "generic" and "picking".
  4. Throw together some rudimentary state sorting. Probably sort for VBO, IBO, static samplers, and then static unis.
  5. Eliminate redundant static calls, ideally on the fly.

It would be good if we could eliminate some redundancy by doing the picking stuff in tandem with the generic draw pass. I think this would work ok, because the picking pass is just a uniform and the binds, so the more constrained sorting could just be re-used.

Aeva commented 8 years ago

Ok, the graph draw function is now generated by the RenderNode (which has a 1:1 relationship with a shader program) rather than by the SceneGraph (who's objects can be associated to many RenderNodes and shader programs).

I added a lispy 'quote' macro, so the entire static draw function is just generated in one go and replaces the RenderNode's 'render' method.

As mentioned in commit a51d55336080f8f09ddaaf1a84920fc396eb2b80, there are some places where I feel this is half-assed and could use revision:

Aeva commented 8 years ago

also note the following demos are still broken, but should be easier to resolve now:

Aeva commented 8 years ago

Currently the jta resources generate their IR once, at load time, with respect to the currently active shader, and never again. The main problem this results in, is that the attributes and uniforms won't necessarily match what is needed to draw under the desired shader program.

The ideal case might instead be an IR object, which puts together the initial bindings, but is able to dynamically add new bindings as needed. Then, when it is time to compile, a method would be called to return the specific IR list needed to draw the object in the current shader.

There is now a 'DrawableIR' object that has a "generate" function that takes a shader program as an argument and outputs an IR list that can be compiled. This also does the same kind of binding tracking as before.

Some notes:

Aeva commented 8 years ago

Currently the location picking demo in master is faster (60fsp average on the x200) than the one in fast_graph (40fps). However, the one in master has some hacks to make some drawables immutable, whereas the tiles in fast_graph are all dynamic (so they call and upload 4 driver functions each before drawing). ... Though when I disable that optimization, it still hits 60, which doesn't seem right o.O When I disable the optimizations (static draw node, manual cache invalidation) on the demo in master, it drops down to about 28fps. Woo :)

So using the location_picking demo to compare only dynamic rendering, fast branch is about 12fps faster than master on my x200 :)

Aeva commented 8 years ago

I added a "freeze" method to GraphNode instances, which causes them to be drawn entirely as static objects. The performance on location_picking shoots up to 60hz in testing with it, giving equivalent or better performance to for static drawing but with a much nicer API and memory usage footprint.

Dynamic drawing is still slow (but marginally faster!) for large numbers of objects.

Aeva commented 8 years ago

Current regressions in demos:

✶ In a newer web browser than the one I've been using, it reports that there is are gl errors thrown stating that the ranges are wrong for [gl draw elements?] calls. Unsure if that is actually the problem with the light demo, though, because fussing around with a picture-in-picture pass at least shows that the gbuffers are rendering correctly.

tasks:

Aeva commented 8 years ago

Basic outline for re-implementing object picking:

  1. PickingNode singleton
    • low resolution buffer
    • use the draw function generator from RenderNode
    • all nodes considered pickable, for now (except particles)
    • shader should limit what properties need to be set
    • picking index should work a little differently: statics should be given incremented IDs upon the picking pass being compiled. Non statics should be enumerated in gl_tick as currently, but not starting from 0
    • graph root ".set_picking_target()". First graph root instance gets it for free.
  2. new picking pass shader (mrt, floating point textures etc) see #192
  3. connect to existing API
  4. remove dead code from old picking system
  5. test, debug, etc
  6. burn-in mechansim? Probably not needed, especially if the end goal is to target asm.js. The rational being that if mouseover is turned on, then this will effectively do the burn in, and if they aren't, picking events will happen infrequently enough that it won't be needed.
  7. "classic" variation that doesn't use mrt or floating point buffers
Aeva commented 8 years ago

Got sidetracked exploring building a simpler case study for codegen as an optimization for a blog post that didn't play out as expected. Which is fine I guess, as it helped clarify how mgrl's NOGL renderer should work.

In the mean time, I've been puzzling over what would be the best to do with alpha blending, and noted some things in the ticket's description above. Might continue with either that or picking next. The two have some similarities though, so it might be good for me to sleep it over a little more and figure out what infrastructure can support both best.

Aeva commented 8 years ago

Moved reimplementing the picking system out into issue #237 to keep notes tidy.

Aeva commented 7 years ago

Merged pick_nouveau into fast_graph, now that issue #237 is closed!!!! :O

Now that the picking system has been revised, and all of the demos reliant on it have been updated, the following needs to be done:

Current regressions in demos:

✶ In a newer web browser than the one I've been using, it reports that there is are gl errors thrown stating that the ranges are wrong for [gl draw elements?] calls. Unsure if that is actually the problem with the light demo, though, because fussing around with a picture-in-picture pass at least shows that the gbuffers are rendering correctly.

✶✶ The particle demo uses the "image instance" drawing path, so maybe a good starting point is to see if that is also similarly broken. update1 yeah, this is likely the problem. Trying to instance the "error" image in another demo spits out this Error: WebGL: drawElements: bound vertex attribute buffers do not have sufficient size for given indices from the bound element array

✶✶✶ This seems to be because the image uri references in the jta file still use ":" as a delimiter, instead of "_". Commit f377f81515f5d8dff9a4d969cee3bdda74c5b64c renames a bunch of images files, but doesn't update any JTA files, so this bug is probably present in master. Manually correcting the paths in the JTA file makes the problem go away, but the fix this correctly, either the blend file needs to be updated, or the exporter does. I'm not sure which, and either way this is probably the scope of a new issue, especially since it is most likely present in master too. it is definitely that the images were renamed, but also that the search paths were wrong in a few demos.

tasks:

Aeva commented 7 years ago

Was feeling a desire to work on features instead of fixing things =) so, the last few commits implement an API for "stamping" assets into a SceneGraph instance without actually creating individual GraphNodes for them. The new "static_drawing" demo demonstrates the API for this. For drawing ~10200 objects, this isn't as fast as it could be running. The problem is obvious when looking at the generated rendering code, where you get a few thousand lines of array declaration syntax... oops.

So, this could be improved by maybe instead, generating and storing a singular long float array at compile time, and referencing sections of that instead.

This could probably be accomplished by tracking the graph ID of the stamp, and storing the IR in an object instead of an array, allowing for the space needed to be allocated at compile time.

Sorting by the stamp's graph ID might not even be needed for this, though it would definitely be helpful later on if we were to also add the potential for automatic instancing with this, too.

There could be some kind of 'Heap' object, where when things get compiled, they request a lease in the heap, and the Heap object will return the starting index for their lease. This would allow for something like subarrays to be used here. At the end of the compilation process, the Heap object reserves the memory it needs. Also something something cache locality.

It could also use an ArrayBuffer for reserving memory, and a typed arrays to access the data, instead of using subarray. I think this would allow the heap to serve multiple data types, but I'm worried that creating a new typed array for every access would just result in lots of object churn. Idk, hopefully compilers would recognize that all I really want here is a pointer and type cast.

I have no idea what a good solution is. This is probably the threshold where games should just use instancing.

Aeva commented 7 years ago

Ok, new branch "heapexperiment" tracks trying to speed up the static drawing demo. It doesn't work right now though, for reasons which are detailed in the last commit message. Considering forking this off into its own issue #238, but gist is that the JS IR objects created for uniform uploads should be replaced with IR that actually just produces the actual GL call. From what I've been able to figure out, that will cause the 'new Float32Array' stuff in the complied output to basically function as a type cast instead of a new object. God damn it, javascript >>

Aeva commented 7 years ago

Closed out issue #238, and merged the resulting branch back into fast_graph.

Aeva commented 7 years ago

Only thing that is overtly broken at this point is the lighting demo (#244). Once that is fixed, its just a couple of loose ends before I'm comfortable merging this into master! O_O

Aeva commented 7 years ago

Did some profiling tonight, since I discovered google chrome's profiler doesn't have any problems with the generated functions, unlike firefox developer edition. Right now, one of the most expensive things is the "__regen_glsl_bindings" stuff, which is something I want to get rid of anyway.

The actual generated rendering functions occupy a very percentage of the run time (about 5%) in the lighting demo!!! Most of the demos run at 60fps (picking causes slight slow downs). Demos in master also run fairly well in chrome, maybe a few fps slower. The "static_drawing" demo runs at about 55fps (~40 in firefox) - the generated function ends up occupying the majority of the run time percentage, but at this point, the sheer bulk of gl calls (for 3721 objects) within it also show up right next to it, so I can't feel too bad about that. The static drawing demo shows that this is definitely a point where there would be a lot to gain from instancing.

Aeva commented 7 years ago

Merged the instancing branch into fast_graph <3