Hbr improvements - Githubissues

manuelk commented 12 years ago

Some suggestions from jcgagnon :

First, seems that all the code is in header files because you templated the precision of the vertex position; I think that makes things heavier for the compiler, and if the positions were stored in some generic attribute conatiner (like in Alembic), and were accessed through some interface instead of being embedded directly in the mesh structure, just like other attributes (UVs, normals, etc), it could help make the code cleaner. The position attribute could itself be templated by its precision, but a generic interface to access to it would avoid to template all the mesh and algorithms based on that. That generic interface could include calls such as processing some linear combination of attributes. And, position-related algorithms could be templated based on precision, but at least it would be localized.

Another thing is the choice, in hbr, to use pointers and instanciated classes (mesh, edge, vertex) directly as the stored structure. Although that might make things look more user-friendly, it makes everything take more memory and have more memory fragmentation. In particular, can no longer reallocate memory blocs and need to use that bloc allocator so your pointers remain valid (in which I would use more than 64 elements per blocs, which seems to be your default). To me, that is a bit like if std::vectors were implemented by storing an array of <iterator, data> pairs; it's a lot of redundancy that doesn't need to be always in memory. Instead, personally I prefer to store data in compact data structures, and instead use temporary iterators (or accessors) to add, when needed, the user friendly wrappers on top of it. For instance, the vertex data would be pure data, but the stack-allocated vertex accessor (or iterator) would wrap both the vertex index and the mesh pointer, and give all the required facilities (get adjacent faces, children faces, vertices, etc). And, using indices instead of pointers make the structure realloc-friendly, which is good since keeping the data in contiguous, big arrays avoids memory fragmentation, which is often biggest performance bottleneck when in polygon mesh data structures.

blackencino commented 12 years ago

So - I had some thoughts about these comments.

I don't think compilation speed should be a governing factor in our design. If we can achieve ideal performance with static polymorphism and template-heavy code, we should do so.

Secondly - with a half-edge or quad-edge or similar 'poly mesh editing' structure, the whole point of it is to be dynamically manipulable on a per-element level in constant time. These structures are necessarily allocated per-element and manipulated via pointers. Once the mesh has been "finished editing" or is finished being constructed, of course it makes sense to bake the half-edge or quad-edge into a vectorized representation - which is what I imagine the Hbr->Nsd transition entails - but otherwise, isn't manipulability and easy traversal the main goal of a mesh representation like that in Hbr?

Please pardon my abstract understanding of the code, I am just responding to Jerome's comments.

Chris

On Thu, Jun 28, 2012 at 10:12 AM, Manuel Kraemer < reply@reply.github.com

wrote:

Some suggestions from jcgagnon :

First, seems that all the code is in header files because you templated the precision of the vertex position; I think that makes things heavier for the compiler, and if the positions were stored in some generic attribute conatiner (like in Alembic), and were accessed through some interface instead of being embedded directly in the mesh structure, just like other attributes (UVs, normals, etc), it could help make the code cleaner. The position attribute could itself be templated by its precision, but a generic interface to access to it would avoid to template all the mesh and algorithms based on that. That generic interface could include calls such as processing some linear combination of attributes. And, position-related algorithms could be templated based on precision, but at least it would be localized.

Another thing is the choice, in hbr, to use pointers and instanciated classes (mesh, edge, vertex) directly as the stored structure. Although that might make things look more user-friendly, it makes everything take more memory and have more memory fragmentation. In particular, can no longer reallocate memory blocs and need to use that bloc allocator so your pointers remain valid (in which I would use more than 64 elements per blocs, which seems to be your default). To me, that is a bit like if std::vectors were implemented by storing an array of <iterator, data> pairs; it's a lot of redundancy that doesn't need to be always in memory. Instead, personally I prefer to store data in compact data structures, and instead use temporary iterators (or accessors) to add, when needed, the user friendly wrappers on top of it. For instance, the vertex data would be pure data, but the stack-allocated vertex accessor (or iterator) would wrap both the vertex index and the mesh pointer, and give all the required facilities (get adjacent faces, children faces, vertices, etc). And, using indices instead of pointers make the structure realloc-friendly, which is good since keeping the data in contiguous, big arrays avoids memory fragmentation, which is often biggest performance bottleneck when in polygon mesh data structures.

Reply to this email directly or view it on GitHub: https://github.com/PixarAnimationStudios/OpenSubdiv/issues/33

I think this situation absolutely requires that a really futile and stupid gesture be done on somebody's part. And we're just the guys to do it.

manuelk commented 12 years ago

I don't think compilation speed should be a governing factor in our design. If we can achieve ideal performance with static polymorphism and template-heavy code, we should do so.

Agreed - but the assumption that the reason for templating isn't accurate afaik : it's not because we are trying to isolate the precision, it's because we are trying to isolate the logic of the subdivision schemes from the actual data represenation (hence the 'r' in the name : representation). This is a very powerful expression of the algorithm because it frees this code from a LOT of the complexity of having to deal with either interleaved or non-interleaved vertex data. Our naive implementation of Nsd (which will be replaced by Osd at Pixar) declared the vvVertex to have all-interleaved vertex data. Using GPUs, the interleaving decision is driven by time invariance : "static" primvars are passed once in non-interleaved vectors (or texture buffers) while "animated" primvars are passed as interleaved VBO data.

The other observation is that Renderman also tends to not make any real distinctions between "position" and "the rest" and considers all data based on interpolation scheme : "vertex", "varying" or "facevarying". The position of a point "P" is just another "primvar" channel, that happens to be interpolated with "vertex" weights, usually along with 4 or 5 other data channels like a "reference pose P", and a multitude of "varying" per-vertex scalar fields. Osd sort of hard-wires the position & normal data into the vertex at the moment, but IMHO that's a design mistake and we will have to generalize this code eventually (we have some ideas floating around involving a "ComputePrimvarArray" set of classes along with a "DrawPrimvarArray" as a way of merging the problems of interleaving the data with the problems of device memory interop into a single solution). I am hoping that we can get to this re-design in the next 2 weeks or so...

Being able to separate the subdivision logic (which is already complex as hell) from all this data representation I describe using this deliciously simple vertex class interface with only an "AddWithWeight" interpolation method is a near stroke of genius IMHO (this implementation is from the PRman team, not mine btw). But that's why i have preserved this templated design pattern all the way into the "Far" section of the code, although i will likely be backing off from the dual templating in many places as soon as I can refactor the code again. The compute dispatcher code is awkward and is causing many problems down the line in Osd, so hopefully we will be able to simplify some of this crazyness.

Secondly - with a half-edge or quad-edge or similar 'poly mesh editing' structure, the whole point of it is to be dynamically manipulable on a per-element level in constant time. These structures are necessarily allocated per-element and manipulated via pointers. Once the mesh has been "finished editing" or is finished being constructed, of course it makes sense to bake the half-edge or quad-edge into a vectorized representation - which is what I imagine the Hbr->Nsd transition entails - but otherwise, isn't manipulability and easy traversal the main goal of a mesh representation like that in Hbr?

Amusingly, this code is written for prman, which does absolutely no vertex editing of any kind... Having said that, using 32b int indices instead of 64b pointers could alleviate a lot of run-time memory overheads. I am not so concerned about fragmentation, thanks to the poly-malloc (which moves blocks of 1024 units if memory serves, not 64 - but i will double check)

If there were a solution to this problem it would probably save a lot of memory for prman, but I am not sure how easy it would be to implement at this stage, nor that 32b ints would be enough vertices... The critical part to remember maybe is that we only edit or manipulate vertices at the coarse level (hand-waving at hiererarchical edits for now). We have been approaching this from the angle of animation / articulation / sculpting / painting / displaying apps, not from the angle of a modeller like Modo, so being able to perform topological analysis at interactive rates on heavily subdivided geometry was never really part of the plan... Maybe we should change this narrow scope and expand our goals ?

Please pardon my abstract understanding of the code, I am just responding to Jerome's comments.

This email has been sitting in my mailbox for a few days now too... and i didn't want to drop a premature wall of negative feedback, so i am still chewing on these suggestions and trying to see if there is a way to reconcile both approaches, as there is a lot of merit to optimizing PRman's Hbr in the future indeed.

PS : sorry i had to edit pretty heavily my original post...

gelder commented 12 years ago

It is interesting to see the contrasts in data layout between the Hbr mesh of small face/vertex/edge objects and the big honkin' flat vectors for the CC tables used at runtime. Clearly with the CC tables we see the gains in coherent memory access with indexing, and I think also clearly we see the overhead with Hbr structures- particularly with release one when we subdivide uniformly in Hbr several levels deep. It takes longer than we'd like to compute the tables, then the tables are nice and fast.

It would be lovely to have a different "front end" library to compute tables than Hbr if we can make it nice and fast. The requirements would be: * Designed for multithreading. Hbr is tricky to subdivide with many threads, just one shared mutex? * Compact and coherent memory access. I wonder if we could store topology information more like a hierarchical tree with "buckets" of topology so multiple threads could each be iterating in one bucket. Currently Hbr (and all other pixar subdiv libraries) require chasing pointers all over the place. * Incremental updates, at the least incremental refinement for feature adaptive subdivision, ultimately supporting incremental modification of the base mesh.

Manuel's description of the intent with Hbr templating "we are trying to isolate the logic of the subdivision schemes from the actual data represenation" is great. Hbr separates the vertex varying data (points and shading data) from the topology and subdivision logic. I wonder if we could take this separation even further with Jerome's approach of "I prefer to store data in compact data structures, and instead use temporary iterators (or accessors) to add, when needed, the user friendly wrappers on top of it". If we could store topology information + positional information in separate compact data structures and have an Hbr-like library that performs catmull clark subdivision by instantiating lightweight wrapper (templated?) classes for iteration and access by lots of threads?

Sigh, lots of handwaving there. Would be sweet though.

jcgagnon commented 12 years ago

Secondly - with a half-edge or quad-edge or similar 'poly mesh editing' structure, the whole point of it is to be dynamically manipulable on a per-element level in constant time. These structures are necessarily allocated per-element and manipulated via pointers

Well, I understand that you need to have per-element treatment, but I don't think that it imposes the usage of pointers or permanent self-sufficient element classes. It's a matter of data abstraction, iterator / accessor VS container. I've rewritten the mesh structure of a well-known DCC package in the past, and it supported a full set of modeling operations.

jcgagnon commented 12 years ago

Agreed - but the assumption that the reason for templating isn't accurate afaik : it's not because we are trying to isolate the precision, it's because we are trying to isolate the logic of the subdivision schemes from the actual data represenation (hence the 'r' in the name : representation). ...

Ok, thank you for this information. My comments were just 'first impressions' while I still don't have a deep understanding of the design / implementation, so thanks for clarifying!

And finally, even if my comments seem negative (I'd say 'constructive'), most of what I saw seemed really clean and well designed, really!

gelder commented 12 years ago

Thanks jcgagnon.

The situation here is much better than it was a month ago after Sigmund changed the heap allocation patterns in Hbr. He got a 2 order of magnitude (!) improvement there, we think now many of the data structures are being allocated in big blocks for coherent memory access and we're jumping around in memory much less.

Still, using integer indices would be better, and we could get more improvements, but it's way better.

How's fatherhood treating you? Missed you at SIGGRAPH.

jcgagnon commented 12 years ago

Nice, didn't expect you guys to jump on that so fast :-) Fatherhood is still waiting, but it's really due now! I hope things go well for you at SIGGRAPH, and let me know if there is anything online that I can see now from your work!

Jerome

manuelk commented 12 years ago

Minor update : i did roll some of the optimizations right before Siggraph, but there is still a fairly chunky pull-request dangling out there.

I first want to get Julian's eyes on it as there may be some performance / memory repercussions for PRman, and I also would really like to have a solid regression for hbr in place to test this new code before launching into open-heart surgery...

manuelk commented 11 years ago

Keeping the discussion going:

lots of changes in HBR : vertices are now integer-indexed and we have patched many optimizations (both in prman and opensubdiv HBR) as well as a couple of bug fixes
on the horizon : the option to use TBB concurrent containers to make HBR thread-safe
possibly extending HBR to support mutliple topo reps (retain half-edges for rendering-type software, introduce "winged-like" solution for modelling-type software)

JC : if you have a few cycles, i wouldn't mind some more 'constructive' comments ;) pre-emptive disclaimer : FAR needs a thorough cleanup pass...

jcgagnon commented 11 years ago

Great! I'll put reviewing this in my 'todo' list for when I have a bit of time.

manuelk commented 10 years ago

Obsoleting this (along with Hbr)

Jerome: please let us know what you think of the new Vtr code if you have some time too look at it :)

jcgagnon commented 10 years ago

Hi Manuel,

Just had a look to the new Vtr code, and it's indeed much more compact as a data structure, and you are doing the topological feature analysis in batch (postponed) like I do in our Fabric meshes. I guess you had quite faster build / traversal times with this new structure?

I see that your adjacency information is limited to 8 bits (polygon points, vertex adjacent polygons). To get around this limitation without adding more memory in general I made our mesh structure more complex, using a bit for indicating if the local index takes more space. But it makes things much more complex, so if you can live with your 256 limitation it's certainly easier.

Keep up the nice work!

PixarAnimationStudios / OpenSubdiv

Hbr improvements #33