Zylann / godot_voxel

Voxel module for Godot Engine
MIT License
2.59k stars 244 forks source link

Implement FastNoiseSIMD #111

Closed TokisanGames closed 2 years ago

TokisanGames commented 4 years ago

@Zylann Can you help me out with the LOD calculation and maybe other ways to optimize getting a whole set of data from FastNoiseSIMD?

Looking up one voxel at a time works fine but is 3x slower than using regular FastNoise. This image is value noise and uses the height bias. In the subsequent images, height bias has been removed for now.

image

Looking up a set works fine on LOD0, but the higher LODS don't line up; not even close. Here I've tried to be clever about applying the LOD to scaling but that doesn't work.

image

Commenting out the scaling line and just looking up sets without any thought of LOD also doesn't work.

image

Here's how the author recommends using it: https://github.com/Auburns/FastNoiseSIMD/wiki

Here are the noise functions available: https://github.com/tinmanjuggernaut/godot/blob/add_fastnoise/modules/noise/fast_noise_simd.h https://github.com/tinmanjuggernaut/godot/blob/add_fastnoise/thirdparty/noise/FastNoiseSIMD.h

If you want to build it yourself, use this Godot branch with this Voxel Tools branch.

VoxelStreamNoise + FastNoise works. VoxelStreamNoise + FastNoiseSIMD does not, so use the separate VoxelStreamFastNoiseSIMD for now.

Also note, FastNoiseSIMD can be multithreaded. So using SIMD over regular, plus multithreaded lookups allows getting very large blocks of noise up to 40x faster than regular FastNoise: benchmarks section testing 8x1024x1024

Maybe when the noise stream is initiated, it just grabs larger blocks of noise, then the emerge functions look up from the buffer.

Zylann commented 4 years ago

https://github.com/tinmanjuggernaut/godot_voxel/blob/dbd0f718707ad3ee31eec553641c3ee99d808693/streams/voxel_stream_fast_noise_simd.cpp#L148

My first thought was you didn't increase the stride based on LOD. The function only looks for samples spaced apart by 1 unit, but they should be spaced by (1 << lod) units. You might have done that by scaling, but I suspect it ended up scaling from (0,0,0) so even if you space samples by the proper length, the origin of the block relative to your scaled noise won't be in the proper position. Another possibility is you might need inverse of the scale, depends how FastNoise interprets it.

Also I'd suggest you pool the noise set using a vector so it doesn't have to allocate it each time.

Also, if FastNoise can be told to generate the 3D area in ZXY order, you could accelerate the loop to be a single one copying to raw channel rather than a triple loop with set_voxel(). It might not bring much but it would bypass unnecessary boundary checks and index calculations, assuming your iteration lines up.

Note for future: I started investigating binary-search sampling from Transvoxel to reduce terracing on non-edited areas, which involves single-value random fetches from the generator, so in that case SIMD won't be usable. But if you can guarantee the SIMD results will line up with the non-SIMD ones, it should be ok.

TokisanGames commented 4 years ago

I've worked out three methods using FastNoiseSIMD:

  1. Singular voxel lookup ==> 3x slower than FastNoise singular voxel lookup
  2. Increasingly larger sets based on LOD size, then use a stride to only pull 1/(1<<lod) of the data out of the sets ==> Extremely slow. This also allocates huge blocks of noise (LOD5 is 512x512x512) at once which ends up being 100x slower than FastNoise and the Godot process takes up 1GB RAM!
  3. Using a scaling factor to get only 16x16x16 sized sets with noise scaled by the library. This wasn't working, but I worked with the author who figured it out.

I needed to do this. The last parameter is the scale: GetNoiseSet(origin.x>>lod, origin.y>>lod, origin.z>>lod, size.x, size.y, size.z, 1<<lod);

I've pushed my updates so my branch is now working.

However, I haven't quite figured out the bias to get the open air on top yet based on noise sets, without broken meshes.

Also I'd suggest you pool the noise set using a vector so it doesn't have to allocate it each time.

Do you mean preallocating a buffer of get_size() and give that to FN to populate so it doesn't have to allocate and free memory every time? Ok.

Also, if FastNoise can be told to generate the 3D area in ZXY order, you could accelerate the loop to be a single one copying to raw channel rather than a triple loop with set_voxel().

I think I can swizzle the axes.

Note for future: I started investigating binary-search sampling from Transvoxel to reduce terracing on non-edited areas, which involves single-value random fetches from the generator, so in that case SIMD won't be usable. But if you can guarantee the SIMD results will line up with the non-SIMD ones, it should be ok.

FastNoiseSIMD and FastNoise produce different results with the same settings.

FastNoiseSIMD with a singular lookup (a set of 1,1,1) iterated over 16^3 produces the same results as a set of 16^3. But the singular lookups are 3x slower. Are the singular lookups only used occasionally? If it's used for large sections we might as well as use FastNoise.

Note that FastNoise isn't any faster than OpenSimplexNoise. It only provides alternative shapes like cellular. SIMD is faster, but very challenging to use.

Should I stop trying to implement SIMD?

Zylann commented 4 years ago

Are the singular lookups only used occasionally?

For now my plan was to first generate the area at the current LOD so SIMD can be used in this pass (which is what happens currently), and affine the results with random-access (which can optionally happen), but only near the isosurface where there is a zero-crossing, so it should be fairly occasional but still relatively frequent in blocks containing a surface. In such situation, noise shaping (as seen in VoxelStreamNoise) is a big winner. I discussed that with the creator of OpenSimplexNoise and turns out he actually did that too^^

TokisanGames commented 4 years ago

I finally have FastNoiseSIMD working properly with LODs, height ranges, bias, and iso_scale! The caveat is that height range must be multiples of 512 (or probably size.y*1<<max_lod) otherwise the LODs won't line up at the top and bottom of the height range.

FastNoiseSIMD loads the terrain about 1 second faster than FastNoise, which is on par with OSN (using Perlin Fractal).

Godot and Voxel Tools branches updated.

@Zylann Why did you choose period to set your iso_scale? const float iso_scale = noise.get_period() * 0.1;

Also why are height_start and height_range floats? You are rounding them anyway, why not just require ints?

Zylann commented 4 years ago

Why did you choose period to set your iso_scale?

I'm not sure how I came to this or how valid it is, but as far as I remember, it was to preserve slopes so the variation speed of the SDF remained the same: image Because as I said on Discord, the fastest the SDF varies, the blockier the result would be.

why are height_start and height_range floats? You are rounding them anyway, why not just require ints?

I am rounding them when I calculate the upper and lower bounds in order to discard areas out of range, but otherwise it is not rounded for calculations in between.

TokisanGames commented 4 years ago

Hmm, FastNoise doesn't have a period on any of the algorithms. But 10-20 seems to work well for all (on LOD0).

Still to do: swizzle axes to copy the raw voxel data channels. Then finalize the Godot PRs now that a use case is demonstrated.

Though I wonder if FastNoiseSIMD is too fringe for merging into the core. The issues with it are:

Maybe I should just make it a plugin or incorporate it into voxel tools. What do you think?

TokisanGames commented 4 years ago

Also, if FastNoise can be told to generate the 3D area in ZXY order, you could accelerate the loop to be a single one copying to raw channel rather than a triple loop with set_voxel(). It might not bring much but it would bypass unnecessary boundary checks and index calculations, assuming your iteration lines up.

It's returned as a flat array in XYZ order, but it's just 3D noise so no one would notice if it's rotated. Or the inputs could be swizzled.

However since bias is calculated by Y every voxel at least one loop is needed to adjust the noise value, before sending the whole array over. Also they are floats, so probably needs to be converted to ints in another array. I looked through VoxelBuffer and didn't see anything exposed to be able to pass over a whole array or get the raw pointer. What functions would I use to get the data into the raw channel?

Zylann commented 4 years ago

Also they are floats, so probably needs to be converted to ints in another array

Why convert to int?

What functions would I use to get the data into the raw channel?

get_channel_raw is the one to use, which gives a wrapped pointer to an array of bytes. Something to do before calling this function is to make sure the buffer contains data (i.e not optimized as "uniform"), and here you will indeed have to convert to integer yourself, and also do it properly for the bit depth you are targetting.

TokisanGames commented 4 years ago

I'm not having success with get_channel_raw. If I put this into VoxelGenerateFastNoiseSIMD::generate_block then I find that every block is uniform. No block returns a buffer channel with any depth or size.

ArraySlice<uint8_t> data;
bool uniform = buffer.is_uniform(_channel);
bool raw = buffer.get_channel_raw(_channel, data);
printf("Buffer (%d,%d,%d): uniform: %d, raw: %d, depth: %d, chsize: %zd\n",
    origin_in_voxels.x, origin_in_voxels.y, origin_in_voxels.z,
    uniform, raw,
    buffer.get_channel_depth(_channel), data.size()
);
TokisanGames commented 4 years ago

Generating landscapes take these time frames to complete generate_block() using set_voxel_f:

VoxelGeneratorNoise/OpenSimplexNoise: 348-1560 usec VGN/FastNoise Value: 200-300 usec VGN/FNSIMD Value: 640-1027 usec (singular) VGFNSIMD Value: 45-114 usec (set)

Zylann commented 4 years ago

@tinmanjuggernaut SIMD looks promising, although I have no clue how to possibly make it work in the upcoming VoxelGeneratorGraph (even though that one comes with range analysis). Maybe still precomputing the noise set before the singular lookups.

TokisanGames commented 4 years ago

What about this comment https://github.com/Zylann/godot_voxel/issues/111#issuecomment-595732224 ?

Zylann commented 4 years ago

@tinmanjuggernaut all buffers passed to the generate function are uniform because they start filled with air. If you can confirm what you are about to generate won't be uniform, then you can use the decompress_channel function to explicitely allocate the channel. Then after that, you can call compress_uniform_channels at the end to recompress the data if it turned out to be uniform.

Zylann commented 3 years ago

Update: FastNoise2 (which does SIMD) is currently in the codebase and has a quick test Godot class but cannot compile yet because Godot 3 is limited to C++14. It will wait for the switch to Godot 4 so C++17 can be used.

Zylann commented 2 years ago

FastNoise2 is now integrated in the master branch (because it uses Godot 4 which targets C++17).