becheran / fast-hilbert

Fast Hilbert space-filling curve transformation using a LUT
MIT License
38 stars 3 forks source link

Get rid of `state` to improve performance (…failed?) #4

Open DoubleHyphen opened 3 years ago

DoubleHyphen commented 3 years ago

I thought I'd try to figure out a way to get rid of the state parametre in both the LUT's index and the results it outputs. I succeeded in achieving that, but to my astonishment it ended up consistently ~50% slower! I tried optimising things further here and there, but no consistent result came up. Here is the main.rs file that explains how I did all those things; in the lib.rs file you'll see the actual methods I used.

That said…

You said in your benchmarks that fast_hilbert is around 2-2.5 times faster than other comparable methods. My own benchmarks show the improvement to be on the order of 8-10 times faster! I strongly suspect that our benchmarks lead to different results. Thus, if this seems interesting to you, feel free to experiment as well and see; otherwise, just close the issue.

PS: The main reason I ultimately decided to show you the work I did is because I think that this will help illuminate the way to generalise to other dimensions.

becheran commented 3 years ago

Interesting.

Did you use the extra two bits to increase the LUT size before you benchmarked? The performance drop can be explained by the bit operation overhead (which is low, but not zero). A LUT lookup on a modern CPU with a hot cache is very faster. But that said, my implementation might be slower on a system without a cache such as a low cost microcontoller. On the other hand on such a system you might better also want to avoid a 256 Byte LUT anyways...

Actually I ran the benchmark again on a full 256 * 256 2D space and the performance gain was more in the range of 5 times faster. See the updated readme. But again, it also depends on the system you are benchmarking on. And the 2.5 times was before your pr which did also increase the performance because counting zeros is faster than the iteration over the LUT and saves time for smaller numbers.

DoubleHyphen commented 3 years ago

I figured out how to make it go faster. However, if I make it calculate the new state imperatively, it merely matches your performance; to make it go faster (by about a third) I had to also extract the state into a new LUT. Please find an example implementation for [u32; 2] -> u64 below:


const STATE_LUT: [u8; 256] = [0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 1, 0, 0, 3, 0, 1, 1, 2, 0, 1, 1, 2, 3, 2, 2, 1, 1, 0, 0, 3, 0, 1, 1, 2, 0, 1, 1, 2, 3, 2, 2, 1, 2, 3, 3, 0, 3, 2, 2, 1, 3, 2, 2, 1, 0, 1, 1, 2, 1, 0, 0, 3, 0, 1, 1, 2, 0, 1, 1, 2, 3, 2, 2, 1, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 3, 2, 2, 1, 2, 3, 3, 0, 2, 3, 3, 0, 1, 0, 0, 3, 1, 0, 0, 3, 0, 1, 1, 2, 0, 1, 1, 2, 3, 2, 2, 1, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0, 3, 2, 2, 1, 2, 3, 3, 0, 2, 3, 3, 0, 1, 0, 0, 3, 2, 3, 3, 0, 3, 2, 2, 1, 3, 2, 2, 1, 0, 1, 1, 2, 3, 2, 2, 1, 2, 3, 3, 0, 2, 3, 3, 0, 1, 0, 0, 3, 3, 2, 2, 1, 2, 3, 3, 0, 2, 3, 3, 0, 1, 0, 0, 3, 0, 1, 1, 2, 1, 0, 0, 3, 1, 0, 0, 3, 2, 3, 3, 0];

const LUT_4_BPC: [u8; 256] = [
    000, 001, 014, 015, 016, 019, 020, 021, 234, 235, 236, 239, 240, 241, 254, 255, 
    003, 002, 013, 012, 017, 018, 023, 022, 233, 232, 237, 238, 243, 242, 253, 252, 
    004, 007, 008, 011, 030, 029, 024, 025, 230, 231, 226, 225, 244, 247, 248, 251, 
    005, 006, 009, 010, 031, 028, 027, 026, 229, 228, 227, 224, 245, 246, 249, 250, 
    058, 057, 054, 053, 032, 035, 036, 037, 218, 219, 220, 223, 202, 201, 198, 197, 
    059, 056, 055, 052, 033, 034, 039, 038, 217, 216, 221, 222, 203, 200, 199, 196, 
    060, 061, 050, 051, 046, 045, 040, 041, 214, 215, 210, 209, 204, 205, 194, 195, 
    063, 062, 049, 048, 047, 044, 043, 042, 213, 212, 211, 208, 207, 206, 193, 192, 
    064, 067, 068, 069, 122, 123, 124, 127, 128, 131, 132, 133, 186, 187, 188, 191, 
    065, 066, 071, 070, 121, 120, 125, 126, 129, 130, 135, 134, 185, 184, 189, 190, 
    078, 077, 072, 073, 118, 119, 114, 113, 142, 141, 136, 137, 182, 183, 178, 177, 
    079, 076, 075, 074, 117, 116, 115, 112, 143, 140, 139, 138, 181, 180, 179, 176, 
    080, 081, 094, 095, 096, 097, 110, 111, 144, 145, 158, 159, 160, 161, 174, 175, 
    083, 082, 093, 092, 099, 098, 109, 108, 147, 146, 157, 156, 163, 162, 173, 172, 
    084, 087, 088, 091, 100, 103, 104, 107, 148, 151, 152, 155, 164, 167, 168, 171, 
    085, 086, 089, 090, 101, 102, 105, 106, 149, 150, 153, 154, 165, 166, 169, 170, 
];
    fn hilbert_index(&self) -> u64 {
        fn separate_nybbles(x: u32) -> u64 {
            let mut x = x as u64;
            x |= x<<16;
            x &= 0x0000_FFFF_0000_FFFF;
            x |= x<<8;
            x &= 0x00FF_00FF_00FF_00FF;
            x |= x<<4;
            x &= 0x0F0F_0F0F_0F0F_0F0F;
            x
        }
        let bloated_1 = separate_nybbles(self[0]);
        let bloated_2 = separate_nybbles(self[1]);
        let key_0 = bloated_2 << 4 | bloated_1;
// Now, each byte of key_0 has one nybble of x and one nybble of y.
        let key_1 = bloated_1 << 4 | bloated_2;
        let key_2 = !key_0;
        let key_3 = !key_1;
        let combineds = [key_0, key_1, key_2, key_3];
        let mut state = 0u8;
        let mut result = [0u8; 8];
        for i in 0..8 {
            let index = combineds[state as usize].to_be_bytes()[i];
            let new_thing = LUT_4_BPC[index as usize];
            result[i] = new_thing;
            state ^= STATE_LUT[new_thing as usize];
        }
        u64::from_be_bytes(result)
    }

Since we have the state LUT already, and since it's indexed by the Hilbert index's keys rather than X and Y, I think it would probably be possible to also use the same LUT to also hep accelerate the inverse transformation. Any ideas?

becheran commented 3 years ago

I am not 100% sure how you crated the state LUT. But you still have 6 "free" bits in the state LUT since you only used 2 so far for the state. So one pragmatic approach could be to compute the inverser LUT and add the result in the existing LUT as well. Does this make sense?