RagnarokResearchLab / RagLite

Standalone client/server application for locally simulating a persistent world
Mozilla Public License 2.0
12 stars 1 forks source link

Add a caching mechanism for terrain geometry to reduce repeated loading times #282

Open rdw-software opened 9 months ago

rdw-software commented 9 months ago

Same as #281 but for GND ground mesh sections.

Needs prototyping to see if the speedup is significant before planning any other work related to this matter.


Goals:

Roadmap:

rdw-software commented 9 months ago

Taking a quick look at the worst case and what I'd consider "average" (regular city or dungeon) maps...

TL;DR: The effect is noticeable, but there's a large variance, so this needs fine-tuning to avoid performance degradation.


Timings with no preallocation (i.e., using the built-in exponential resize mechanism of LuaJIT):

# schg_dun01.gnd
[RagnarokGND] Finished generating terrain geometry for 11 ground mesh section(s) in 906.99 ms

# lighthalzen.gnd
[RagnarokGND] Finished generating terrain geometry for 53 ground mesh section(s) in 360.76 ms

# prontera.gnd
[RagnarokGND] Finished generating terrain geometry for 12 ground mesh section(s) in 217.37 ms

# pay_dun00.gnd
[RagnarokGRF] Blocking read for 6.02 ms; decompressed 437 KB in 30.94 ms

And now with applying a 16k preallocation (likely extremely suboptimal):

# schg_dun01.gnd
[RagnarokGND] Finished generating terrain geometry for 11 ground mesh section(s) in 621.54 ms

# lighthalzen.gnd
[RagnarokGND] Finished generating terrain geometry for 53 ground mesh section(s) in 458.79 ms

# prontera
[RagnarokGND] Finished generating terrain geometry for 12 ground mesh section(s) in 27.05 ms

# pay_dun00
[RagnarokGND] Finished generating terrain geometry for 4 ground mesh section(s) in 135.46 ms

Trying 8k then:

# schg_dun01.gnd
[RagnarokGND] Finished generating terrain geometry for 11 ground mesh section(s) in 609.38 ms

# lighthalzen.gnd
[RagnarokGND] Finished generating terrain geometry for 53 ground mesh section(s) in 448.29 ms

# prontera
[RagnarokGND] Finished generating terrain geometry for 12 ground mesh section(s) in 29.84 ms

# pay_dun00
[RagnarokGND] Finished generating terrain geometry for 4 ground mesh section(s) in 124.88 ms

Note that this is a sample size of one (per map), so gathering more statistics via scripting would be a good idea.

rdw-software commented 9 months ago

I can't reproduce the original results; adding a simple buffer size cache and using table.new made no difference. Strange...

Also tried caching the geometry, for now as JSON and binary (typed arrays, basically) only. Unsurprisingly this is a lot faster.


With dumb JSON cache (250ms worst case vs 650 ms without caching) - not practical of course:

evo -p client.lua schg_dun01
26%  DecodeFileEntries
19%  DecodeFileName
12%  FetchResourceByID
10%  discardTransparentPixels
 8%  LoadTerrainGeometry
 4%  CopyImageBytesToGPU
 3%  normalize
 3%  CreateInstance
2,21s user 0,46s system 96% cpu 2,765 total

Default (no caching whatsoever):

evo -p client.lua schg_dun01
19%  DecodeFileEntries
18%  DecodeFileName
11%  FetchResourceByID
 9%  discardTransparentPixels
 8%  GenerateSurfaceGeometry
 5%  nextPowerOfTwo
 3%  Add
 3%  GenerateGroundVertices
 3%  CreateInstance
2,76s user 0,40s system 97% cpu 3,247 total

With a simple binary geometry cache (worst case becomes 25(!) ms):

evo -p client.lua schg_dun01
26%  DecodeFileEntries
24%  DecodeFileName
16%  FetchResourceByID
13%  discardTransparentPixels
 4%  CreateInstance
 3%  CopyImageBytesToGPU
1,95s user 0,46s system 96% cpu 2,505 total

This level of speedup seems to also be observable for regular and smaller-sized maps:

evo -p client.lua prontera
# Cache miss
1,90s user 0,34s system 95% cpu 2,344 total
# Cache hit
1,55s user 0,29s system 95% cpu 1,928 total

evo -p client.lua pay_dun00
# Cache miss
1,66s user 0,32s system 94% cpu 2,089 total
# Cache hit
1,45s user 0,28s system 95% cpu 1,822 total

Evidently, compiling the geometry once and just dumping it to disk effectively eliminates the loading time (for terrain only).


Conclusions:

Based on this experiment, implementing a binary cache seems like a worthwhile feature if it's kept stupid simple.

rdw-software commented 9 months ago

One other thing worth considering: What if the binary cache is compressed as well? POC using the miniz C API:

# Cache miss (with compression step)
evo -p client.lua schg_dun01 
3,30s user 0,52s system 97% cpu 3,926 total

# Cache hit (with decompression step)
evo -p client.lua schg_dun01 
2,09s user 0,44s system 96% cpu 2,621 total

# Cache miss (with decompression step)
evo -p client.lua pay_dun00
1,67s user 0,26s system 95% cpu 2,033 total

# Cache hit (with decompression step)
evo -p client.lua pay_dun00
1,45s user 0,31s system 95% cpu 1,851 total

# Cache miss (with decompression step)
evo -p client.lua prontera
2,15s user 0,34s system 96% cpu 2,581 total

# Cache hit (with decompression step)
evo -p client.lua prontera
1,58s user 0,31s system 95% cpu 1,973 total

Observations:

It seems that compression is not worth it - it could be that the C API bindings are adding too much overhead and FFI would be faster? Disk usage of the cache is obviously much lower if compressed, on the order of 20+ MB vs 3 MB - but that's irrelevant here. It shouldn't be too difficult to check how long the compression takes in the runtime, but after that I'm out of ideas (for now).

rdw-software commented 9 months ago

In the miniz C API, there is one extra copy. However, avoiding it via FFI bindings won't change the end result:

-- schg_dun01
Uncompression and buffer resize time: 98401 microseconds = 98 ms
lua_pushlstring time: 9802 microseconds = 10 ms

-- pay_dun00
Uncompression and buffer resize time: 8463 microseconds = 8 ms
lua_pushlstring time: 1333 microseconds = 1 ms

-- prontera
Uncompression and buffer resize time: 29433 microseconds = 30 ms
lua_pushlstring time: 2663 microseconds = 3 ms

Looks like in the very best case both methods would about break even, the only difference being disk usage (meh).

rdw-software commented 9 months ago

One other point to keep in mind: Merely decompressing GND files from GRF can take 250 ms, so if lightmaps and terrain can be loaded from their compiled format there isn't any reason to extract the file at all, which would reduce loading times even more. However, this only works if all the data is compiled - even water planes and texture paths, possibly even the grid dimensions.

Not that this is difficult to do, but it adds complexity as there's effectively two supported binary formats, in multiple versions.