RenderKit / embree

Embree ray tracing kernels repository.
Apache License 2.0
2.4k stars 389 forks source link

Add support to load/store the acceleration structure #137

Closed Anteru closed 7 years ago

Anteru commented 7 years ago

I have various very large meshes, and it would be nice if there was a way to store the acceleration structure and reload it instead of having to rebuild it every time. Obviously in the ideal case it would be possible to store them in some format which is forward/backwards compatible, but for 90% of the uses, even a machine-dependent format would be sufficient (i.e. just a binary dump).

cbenthin commented 7 years ago

Hi, could you elaborate for which particular use case you think reloading the acceleration from memory/disc is faster than rebuilding it? Reason I'm asking is that Embree's BVH builders are really fast and assuming you have a couple of CPU cores in your machine which can be used for speeding up the BVH building phase, rebuilding the acceleration structure will actually be faster than loading a binary dump from memory (according to our benchmarks). Also it's not an "easy" feature to add as we consistently change the internal data structures from one version to another and we would have to keep it backwards compatible etc. If the BVH builders are not fast enough for your purpose it's probably easier to speed them up compared to introducing a storing/loading BVH binary dump feature.

cbenthin commented 7 years ago

BTW: How big are your models and what kind of machine are you using (e.g. #CPU cores, frequency etc)?

Anteru commented 7 years ago

Low number of CPU cores (4 - some users have just two on notebooks), usually lots of memory (>24 GiB), and triangle counts between 20-500 million polygons. Especially with a slow CPUs, it would be nice to cache that so there's no startup cost beyond reading it from drive (which is ~200 MiB/second.) Rendering being slow is ok, it's about optimizing time & memory to first pixel.

I'm fine if it's not backwards compatible, I don't update Embree that often.

cbenthin commented 7 years ago

4 cores is really not a lot. Do you use the high quality mode? What are your BVH build times on your machine and how much is it in relation to the total startup time? If startup time is the primary issue did you try to set the geometry flags to dynamic to invoke our fast (but less quality) builders? If yes does it make a difference?

Btw: You can also send an email to embree_support@intel.com to discuss the issue directly over email.

cbenthin commented 7 years ago

I did some quick benchmarks just using 4 threads on my workstation machine and the crown model (4.8M triangles). For our standard/default BVH builders I get a build time of 988ms which corresponds to 4.9 Mprims/s. In terms of memory write speed for storing the BVH data to memory this corresponds to 315 MB/s. So I think even with 4 threads you are likely to rebuild faster than loading from disk.

Anteru commented 7 years ago

Just benchmarked on the 350M polygon scene, the rtcCommit takes 1 minute, 8 seconds, without paging (it just barely fits into the 24 GiB on the test machine.) The BVH shouldn't be more than a couple GiB in size, so I'd expect to be able to load this faster off disk (especially with any kind of modern SSD) than recompute it on load.

cbenthin commented 7 years ago

I get basically the same for the Boeing model (350M tris), around 5Mprims/s with just 4 threads, which corresponds to a write speed of 305 MB/s (21 GB total size including pre-gathered triangle data). Did you try our fast builders too? Would this be an option?

In general our BVH builders scale pretty well with increased CPU core count so I'm skeptical whether it make sense to introduce such a feature, in particular as our standard customer machine configuration has more than 4 cores. My dual-socket workstation setup builds the BVH for the Boeing model in less than 10 seconds.

There's also the option to directly use our triangle-pair feature (quads) as many models actually consists of quads/triangle pairs. That would in the ideal case half your input triangles and half the BVH build time.

Another question: In general are you using the 4-core machine also for rendering these big models or for just building the BVH and do the rendering e.g. on the GPU? If that's the case you could use our external BVH build API which allows you to do whatever you want with the BVH tree.

Please shoot me an email directly and we can try to find a workaround for your setup. Thanks.

kayru commented 7 years ago

Hi,

We are also interested in a bvh binary dump, ideally memory-mappable. All the usual caveats wrt portability and compatibility between versions are absolutely fine for our use case. You might be able to dig out an old email thread (from ~a year ago) about this :)

Since PPL support was added, it became possible for us to use multi-threaded builder in our environment, therefore pre-baked bvh became a lot less important. My gut feeling is that it's still going to be a performance win for us to load the bvh instead of building it in some cases, since we can do other work on CPU while reading data asynchronously.

Yuriy

cbenthin commented 7 years ago

Hi Yuriy,

Memory mapping large binary files which will be randomly accessed during rendering is actually quite tricky because page fault handling contention the OS kernel. Depending on the OS and kernel version the access to the page (or even the entire mapped memory region) is serialized in the kernel for the very first access. This can introduce quite an overhead, in particular with dozens/hundreds of threads hammering on the region. It would only be worthwhile if the memory footprint accessed during rendering would be small compared to the entire file size.

My more general question is how much faster would the BVH build need to be to make the binary dump option irrelevant (on machines with a low core count)? Would a 1.5x be enough, or a 2x or even more?

For the bigger models. let's say 10-300+ million triangles is actually relatively easy to get a ~1.5x speed-up, if one is willing to sacrifice BVH quality a bit and can subdivide the scene into multiple objects. In this case the two-level BVH builder is significantly faster than building a single BVH over the entire scene. One needs however the latest Embree 2.16.x for this. I can provide more details if anyone is interested.

Btw: Is your typical machine setup also a 4-core machine or do you use machines with a higher core count?

Thanks.

Anteru commented 7 years ago

The other part of the story is the extra memory used to build the tree. If that can be taken out of the equation, machines which are not powerful enough to build the BVH can still display data generated elsewhere. For me the main issue is that I have lots of time during import/data preparation, but little control over the viewing environment, so the less memory/CPU time I need there, the better. And unfortunately a lot of people are using notebooks which don't come with dual-socket EPYC :( I'll shoot you an email with the details once I have done some more measurements.

cbenthin commented 7 years ago

Did you try the scene COMPACT flag? This mode will not use any temporary data during BVH build. Sven added this very useful feature a couple of versions ago.

Granted, 2-4 core notebook systems might not be our typical customer setup so we haven't had this "BVH build on low-core system" issue on our radar. Therefore, I really would like to know more details on your setup and use case. Looking forward to your email.

cbenthin commented 7 years ago

Based on some off-line discussion with Anteru I would say the initial BVH build performance using a two-level BVH (+ fast BVH builders per object + using triangle-pairs instead of triangles) is sufficiently high enough even on a low-end 4-core CPU so that loading a pre-stored BVH is very unlikely to provide further improvement in initial BVH build time. However, we'll probably look into reducing memory consumption for the BVH even further in a future version of Embree. I'll close this issue for now.