Cubitect / cubiomes

C library that mimics the Minecraft biome generation.
MIT License
603 stars 105 forks source link

Small inaccuracies in biome generation at 1:4 scale #61

Open Badel2 opened 3 years ago

Badel2 commented 3 years ago

Hi, I have a repo where I compare the biomes generated by cubiomes with the biomes that are stored in the region files of a generated world:

https://github.com/badel2/cubiomes-test-generation

I tried it with a world generated using snapshot 1.18-rc3, and there are some small differences at some biome borders, usually only 1 block. Is there some option to enable high accuracy? My code is here.

Test cases: coordinates are at 1:4 scale, expected is the one read from the world files, got is the one from cubiomes: Seed 1234 Biome mismatch at (-95, -16, -56), expected 4 got 7 Biome mismatch at (-79, -16, -53), expected 7 got 0 Biome mismatch at (-136, -16, 13), expected 4 got 1

Some images, the incomplete image is the one from the world files:

seed_1234_cubiomes seed_1234_fastanvil

Diff, basically the 4 pixels that stand out:

seed_1234_cubiomes_fastanvil_diff

Cubitect commented 3 years ago

Oh I know, this is caused by MC-241546 and I don't think it is possible (or at least practical) to simulate the biome that gets stored in the save files by the current version, in particular since different parts of the Minecraft code disagree anyway.

Badel2 commented 3 years ago

Wait, so it's non-deterministic?

Cubitect commented 3 years ago

Well there is not the one biome mapping it 1.18-rc3. If you want get the biome generation that ends up in the save file, you can try to use sampleBiomeNoise() directly and provide a previous (unknown) sampling result in the dat argument and then generate the biomes for a whole chunk by iterating in the correct manner. I tried to do that for the stronghold generation though and from issue #60 I am assuming that it didn't entirely work correctly (I haven't investigated yet).

Cubitect commented 3 years ago

I have added a function genBiomeNoiseChunkSection() which should generate the biomes in the same order as the MC chunk generator. I would be curious if this produces accurate results, compared to the level file. Your test code looks useful :) While testing, I actually tried to make a rudimentary extraction of the biomes from a level file but had some trouble and didn't follow though with it in the end.

Badel2 commented 3 years ago

Thank you! I implemented a genBiomesAccurate using that function, and it does fix the cases I mentioned earlier. But I tried it with other seeds, and it seems there are still some differences, for example: (I used 1.18-rc4 this time)

seed -4100855569562546563 Biome mismatch at (640, -16, -16), expected 4 got 1

However I only see 2 pixel errors in the generated image, and with the old code there were about 25. So it's an improvement.

Also I tried to generate the same world 4 times and the biomes seem to match, so I guess that the biomes that are stored in the level file are deterministic, but not 100% sure yet because generating worlds takes very long. So I will try to gather more data.

Cubitect commented 3 years ago

The biomes for the chunk are generated all in one go, so the previous sampling positions are (almost) deterministic. To get an even better results you can try to also generate the chunk section below the one you are after to make sure the sampler is in the correct state. The chunk sections are generated from the bottom of the world upwards.

The prime candidate for non-deterministic biomes in the level storage is just the first (0,-16,0) biome coordinate in each chunk, since this will depend on whichever chunk was generated before it.

The issue with this bug is more that other parts, such as the population do their own biome generation which can disagree with the level storage.

Cubitect commented 3 years ago

So maybe something like this (completely untested):

int chunk_biomes[4][4][4];
for (int cz = (r.z) >> 2; cz < (r.z + r.sz) >> 2; cz++) {
    for (int cx = (r.x) >> 2; cx < (r.x + r.sx) >> 2; cx++) {
        uint64_t dat = 0;
        int cy = (r.y) >> 2;
        if (cy-1 >= -4) {
            // Get a hopefully accurate biome sampler state for 'dat'
            genBiomeNoiseChunkSection(&g->bn, chunk_biomes, cx, cy-1, cz, &dat);
        } // else { non-deterministic start of chunk }

        // Iterate over the chunk sections in vertically ascending order
        for (; cy < (r.y + r.sy) >> 2; cy++) {
            // Generate accurate biomes for this chunk section
            genBiomeNoiseChunkSection(&g->bn, chunk_biomes, cx, cy, cz, &dat);

            // Copy biomes to output buffer
            for (int by = 0; by < 4; by++) {
                for (int bz = 0; bz < 4; bz++) {
                    for (int bx = 0; bx < 4; bx++) {
                        int bb = chunk_biomes[bx][by][bz];

                        int x = (cx << 2) + bx;
                        int y = (cy << 2) + by;
                        int z = (cz << 2) + bz;
                        size_t idx = (x - r.x) + ((z - r.z) * r.sx) + ((y - r.y) * r.sx * r.sz);
                        cache[idx] = bb;
                    }
                }
            }
        }
    }
}
Badel2 commented 3 years ago

You seem to be right! I checked the world with seed -4100855569562546563 again, with the same code, and if I skip the y-level=-16 all the other biomes match. So I can simply skip that level and check the rest. And thanks for the code example!

TODO list for myself:

Badel2 commented 2 years ago

So quick update, I generated 4 worlds more, in 3 of them all the biomes match, but the last one has small differences. I used the final 1.18 release this time.

Seed -7088473816240932309 Biome mismatch at (-30, -7, -120), expected 175 got 45 Biome mismatch at (38, -5, -104), expected 175 got 45 Biome mismatch at (38, -5, -103), expected 175 got 45 Biome mismatch at (38, -5, -102), expected 175 got 45 (and I guess some more)

I also tried to generate the world with seed -7088473816240932309 4 times and the biomes in the level files were always the same, I can't seem to find a clear example of non-determinism.

Some images, at y-level=-5.

Cubiomes

map_cubiomes

World files

map_zip

Diff

map_diff

Cubitect commented 2 years ago

Thanks for the effort. I will have a look if I find something, but it might take a little. These types of issues are a pain to track down in the debugger..

Cubitect commented 2 years ago

After much testing and thought I have come to the conclusion that the biomes for the anvil storage will indeed be deterministic, since the last sampling result from any previous chunk will always be a bad initial state for the first sampling in the new chunk and thus will effectively reset.

The remaining discrepancies seem to be caused by floating point precision. Minecraft uses floats for the most part, but switches to doubles occasionally. my implementation does not mimic these switches exactly.

Badel2 commented 2 years ago

the last sampling result from any previous chunk will always be a bad initial state for the first sampling in the new chunk and thus will effectively reset.

So can that be emulated as a special dat value? Or it is something more complicated?

I have also been doing some research, trying to understand the p2overworld function. Correct me if I'm wrong, but that function converts a "climate" into a "biome id" using the "distance" to some biome parameters, which are encoded in the biome tree. The non-determinism comes from the way the biome tree is traversed, because it depends on some cache or something. But this only happens in case of tie: when there are two or more biomes at the exact same "distance". With that in mind, I can implement a function that returns the distance to the second closest biome, and if that distance is the same as the distance to the first biome, this is a "non-deterministic" point and I can just ignore it in the biome comparison. I did a quick test and all the examples from this thread are ties in biome distance.

However you just said that this non-determinism does not apply to biomes from the world files, so the remaining errors must be because the distance is incorrectly calculated, right? Like there is a tie in biome distance but that tie does not exist in the official implementation. The causes could be rounding errors or floating point shenanigans such as a + b + c being not equivalent to a + c + b. I don't know the details of floating point arithmetic in the JVM, but it would be interesting to check if the behavior changes depending on the CPU architecture or something.

So I don't know to which extent could the floating point issues be fixed, but if they cannot, then you could get a performance boost by compiling with -ffast-math (make sure to add tests because that can break a lot of code).

Cubitect commented 2 years ago

Pretty much. The noise biome generation can be separated into two steps:

First a bunch of noise generators are used to yield a set of integers which are meant to represent the 'climate'. This is the part where floating point inaccuracies can manifest. Floating point arithmetic does use the IEEE-754 standard, which dictates exactly how rounding is dealt with so you won't find any changes between JVMs or CPUs (however -ffast-math enables optimizations that violate this standard). So in principle if I do the exact same arithmetic in the same order, the result will be the same. However, this would restrict any optimization that can be made and it's sometimes difficult to follow the Minecraft code exactly, as nowadays it makes a lot of use of lambda functions, which are almost impossible to debug and can suddenly execute a completely different part of the code in some context that is not apparent from the code. So I'm currently not particularly keen to emulate all the instructions exactly to be honest.

(p2overworld:) The second step is to get a biome for the set of noise values that we got. The theory here is to define a bunch of fixed noise points that correspond to certain biomes. For the generated noise values, we then choose the fixed noise point that is closest and return the corresponding biome. The problem is that there are around 8000 of these fixed noise points and it would take far too long to check all of them to find the closest (or second closest for that matter). Furthermore, common techniques to speed up a nearest neighbor search, such as a Barnes-Huts tree don't really work here because the points are 6-dimensional (the required segmentation of space would approach the number of data points, which would nullify any performance benefits). Instead Mojang just opted to sort the points into a tree that segments the points into groups with power-of-10 sizes. This is arbitrary and will not all ways find the noise point which is closest, but limits the number of distances that have to be calculated, while still yielding something that is close to the true closest point.

However, this tree search was still a performance bottleneck, so Mojang decided to save the previously closest node (i.e. fixed noise parameter) and abort the tree traversal if the saved result seems to be a closer or equal distance to any node in the tree. Unfortunately, since the tree does not always yield the true closest node, if the saved result is better or equal to the tree result then it may be different. This part is not a just a rounding issue and I expect there to be situations where the tree does not even find the second or third closest point.

Cubitect commented 2 years ago

Oh, and regarding the deterministic 'reset' from chunk to chunk; almost all the fixed noise points have two copies for the depth parameter, one at depth=0 and one at depth=10000. In contrast to all the other noise values, the depth has a component that relates directly to the world height. So when the biome generation for one chunk ends at the top of the world height, then the saved noise point will be a large 'distance' away from the new noise value, sampled near the bottom of the world. So the saved noise point will never get re-used in this case.