deadsy / sdfx

A simple CAD package using signed distance functions
MIT License
541 stars 52 forks source link

add slice render method #49

Closed soypat closed 2 years ago

soypat commented 2 years ago

Octree rendering got almost a x2 speed boost. With a little creative freedom I'd offer rewriting the STL part of the render package as it seems to be lush with room for improvement, both speed and API thoughtfulness such as adding io.Writer convention, minimizing heap usage, and more.

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -coverprofile=/tmp/vscode-go3I2Mcj/go-code-cover -bench . github.com/deadsy/sdfx/render

goos: linux
goarch: amd64
pkg: github.com/deadsy/sdfx/render
cpu: AMD Ryzen 5 3400G with Radeon Vega Graphics    
BenchmarkSaveSTL-8             2     538800719 ns/op    288716684 B/op    547006 allocs/op
BenchmarkStreamSTL-8           2     915093915 ns/op    67749084 B/op     550130 allocs/op
PASS
coverage: 25.1% of statements
ok      github.com/deadsy/sdfx/render   4.422s
deadsy commented 2 years ago

So the basic idea is... render to a slice of triangles (in memory) rather than write it via a channel to the file? That's a speedup for octree and regular marching cubes right? (at the cost of allocated memory)

A few observations...

building a bolt in the benchmark - the screw form doesn't render well with octree marching cubes. because the signed distance field is actually wrong at far distances. Probably doesn't matter for the purposes of the benchmark, but still...

Octree marching cubes is single threaded. getting some go routines in there could give some xN speedups. (N == number of cores)

What I really want to see work properly is the double contouring renderer. At the moment it's slow and the result is shitty. I'd be ok with it taking a little bit longer than marching cubes if the stl files were smaller and the quality of the mesh just as good.

soypat commented 2 years ago

I believe both would benefit from the same rewrite since the use of channels is inefficient in this case. I do have an idea for speeding it using multi-core functionality up but it requires considerable concurrency infrastructure. Channels should ideally receive a large batch of triangles to write to file as a []Triangle3 since the channel's send overhead becomes a bottleneck if sending individual Trangles.

I think the best approach is to first have a working non-concurrent simple implementation of the renderer without fancy channels. Once you have a working single threaded renderer one can start thinking of reusing the single-threaded functions one has at hand to be multithreaded and developing a MultiCoreRenderer3 interface or something of the sort.

If I am given your blessing I will tear down the concurrent implementation as it stands today and rewrite all of it to be single-threaded with the possibility of a even faster multi core version tomorrow.

As for the double contouring renderer I can give that a go as well- Be warned I'm not that well versed in computational geometry, if that's were the problem lies.

soypat commented 2 years ago

I am also in need of solving https://github.com/deadsy/sdfx/issues/35, which was the primary reason I forked this repository... I may end up shaving more yaks than I planned.

deadsy commented 2 years ago

if I am given your blessing I will tear down the concurrent implementation as it stands today and rewrite all of it to be single-threaded with the possibility of a even faster multi core version tomorrow.

The normal marching cubes (march3.go) is concurrent. The octree marching cubes (march3x.go) is single-threaded.

btw - the reason the slower march3.go is kept around is because of the afore-mentioned issues with rendering objects with approximated distance fields (screw threads mostly).

Other than that the octree renderer is equivalient to uniform cube decomposition of space- it just does less work.

If performance is a significant concern then there's a big gain to be had in throwing more cores at the octree renderer.ie xN where N is how many cores you have. In this case a channel implementation for streaming the triangle out is advantageous because it takes care of concurrency issues for you.

// External code writes triangles to this channel.
// This goroutine reads the channel and writes triangles to the file.
c := make(chan *Triangle3)

It might be an idea to try some channel buffering experiments to see if there are any performance gains to be had from that. I suspect most of the gains you've seen with an in-memory slice could be gained by a bit more decoupling between the renderer and the file writer.

io.Writer

That's a byte buffer oriented interface. Marching cubes creates a stream of triangles, and how you choose to marshal those into bytes is somewhat arbitrary, e.g. stl, 3mf, ....

Now nothing stops you from building a converter - triangle input channel, write to io.Writer with a marshalling format of your own design- but that doesn't belong in the renderer.

E.g.

object + renderer -> stl writer (to file) object + renderer -> 3mf writer (to file) object + renderer -> converter (to io.Writer)

soypat commented 2 years ago

The normal marching cubes (march3.go) is concurrent. The octree marching cubes (march3x.go) is single-threaded.

I think you are mixing them up? MarchingCubesUniform is very much single threaded from what I'm seeing here:

func (m *MarchingCubesUniform) Render(s sdf.SDF3, meshCells int, output chan<- *Triangle3) {
    // work out the region we will sample
    bb0 := s.BoundingBox()
    bb0Size := bb0.Size()
    meshInc := bb0Size.MaxComponent() / float64(meshCells)
    bb1Size := bb0Size.DivScalar(meshInc)
    bb1Size = bb1Size.Ceil().AddScalar(1)
    bb1Size = bb1Size.MulScalar(meshInc)
    bb := sdf.NewBox3(bb0.Center(), bb1Size)
    for _, tri := range marchingCubes(s, bb, meshInc) { // this is a slice, not a channel
        output <- tri
    }
}

While MarchingCubesOctree makes use of a dcache3 type which does send triangles over a channel. It is my understanding

I wrote a single-threaded implementation for MarchingCubesOctree which was around twice as fast as the original version, which from my understanding as I have outlined here, is multi-core? Maybe I'm not following you completely. Anyways, performance is a minor concern of mine, if it can be gained, so be it. My major concern is that the current implementation jumped the gun and tried to be concurrent before having a good single-threaded implementation. This makes for a unwieldy API for working with the STL render functions.

I want to write a real-time browser 3d renderer (current sdf-ui implementation is slow and clunky) using three.js and Go WASM bindings to make it a pleasure to work with sdfx (this is for the CERN-organized hackathon/competition). As the render package exists today I much prefer to fork the repository and rewrite it from scratch to best fit my use case. This is because the current implementation is lacking in thoughtfulness to what users would be using. There is also a concern that the API as it exists today is slow because of this premature concurrent optimization.

As a user, contributor and just random guy on the internet, I strongly suggest the render package be thought from the ground up. Not only would in make it nicer to work with, it would also make the package as a whole much easier to contribute to! Having a single threaded implementation is simple and easy to follow. New users could improve these functions and they'd also be improving the multi-core version since ideally the multi-core renderer would also use these functions!

deadsy commented 2 years ago

I think you are mixing them up

No. Read the code. The dcache3 stuff had locks put on it in preparation for a multi-threaded octree renderer, but it's not currently necessary.

real-time browser 3d renderer

That's a different problem than the renderer deals with. ie - 3d preview concerns itself with visible faces while STL generation has to concern itself with the whole object.

soypat commented 2 years ago

No. Read the code. The dcache3 stuff had locks put on it in preparation for a multi-threaded octree renderer, but it's not currently necessary.

I was wrong. I think I managed to find the parallelization in normal marching cubes in an init() function which starts up a workerpool on evalProcessCh.

That's a different problem than the renderer deals with.

I'm not concerning myself with low level 3d preview- three.js receives a 3D object, in this case that could be a bunch of triangles and it itself does the 3D face culling and whatnot. I just need the whole set of triangles and three.js will provide a fluid 3d preview of the whole part.

That's a byte buffer oriented interface. Marching cubes creates a stream of triangles, and how you choose to marshal those into bytes is somewhat arbitrary, e.g. stl, 3mf, ....

For my application triangles become bytes when I send them over http. I guess I could use the Render3 interface, but more on why I'm not a fan of the Render function signature below.

It might be an idea to try some channel buffering experiments to see if there are any performance gains to be had from that. I suspect most of the gains you've seen with an in-memory slice could be gained by a bit more decoupling between the renderer and the file writer.

Yes, this is true. The single-triangle queue given by the Render3 is a huge bottleneck. A better signature for this could be Render(sdf3 sdf.SDF3, meshCells int, output chan<- []Triangle3), where output receives batches of triangles... though this is questionable design for several reasons

  1. Raises many questions about usage of the function Render- who closes the channel? Does this function block? Do I have to call it as a goroutine?
  2. To implement a renderer one needs to bake in concurrency which is a hard thing to get right from the start. This also means all rendering code besides having the responsibility of computing geometry, it also has to handle concurrency features of the language. This leads to rendering functions with dual responsibility- compute geometry and also handle the multi-core aspect of the computation making code harder to maintain in the long run

I'm not sure what form a "good" Render interface would have. I'd really have to think long about it. It would be awesome if there was no channel handling on the user's side, but rather that happened internally.