DotBots / DotBot-firmware

Firmware applications used to control DotBots and SailBots
http://www.dotbots.org
BSD 3-Clause "New" or "Revised" License
5 stars 8 forks source link

Optimizing LH2 code #271

Closed SaidAlvarado closed 7 months ago

SaidAlvarado commented 8 months ago

LH2 Code Benchmark:

The main bottle necks are select_polynomial(), and reverse_count_p(). Their worst case scenarios stall the CPU for too long. If we can optimize them to run faster, or prioritize speed over packet loss, we might be able to get the system running real time.

SaidAlvarado commented 8 months ago

Select_Polynomial()

Change *start_val

most of the the bit errors occur early in the spi sequence, if you change *start_val, to a higher value than 0, the function usually finds the polynomial in a single iteration. At the expense of more polynomial errors.

The following test used 4 lighthouses, at 1~2 meter distance

units are in miliseconds Start val == 0 mean 7.4ms, std: 10.4ms, min: 3.0, max: 84.7, poly errors: 31/2928 errors, non-single-iteration-find: 1144/2928 Start val == 1 mean 5.5ms, std: 8.4ms, min: 2.9, max: 81.5, poly errors: 21/2920 errors, non-single-iteration-find: 686/2920 Start val == 2 mean 4.4ms, std: 6.5ms, min: 2.9, max: 78.5, poly errors: 29/5153 errors, non-single-iteration-find: 809/5153 Start val == 3 mean 4.7ms, std: 9.5ms, min: 2.8, max: 75.7, poly errors: 32/2122 errors, non-single-iteration-find: 226/2122 Start val == 4 mean 3.7ms, std: 5.6ms, min: 2.9, max: 77.3, poly errors: 14/3060 errors, non-single-iteration-find: 189/3060 Start val == 5 mean 3.7ms, std: 7.7ms, min: 2.7, max: 70.1, poly errors: 33/3000 errors, non-single-iteration-find: 126/3000 Start val == 6 mean 3.4ms, std: 6.1ms, min: 2.6, max: 67.3, poly errors: 24/3054 errors, non-single-iteration-find: 126/3054 Start val == 7 mean 2.8ms, std: 3.7ms, min: 2.6, max: 64.7, poly errors: 7/2419 errors, non-single-iteration-find: 27/2419 Start val == 8 mean 2.9ms, std: 4.6ms, min: 2.5, max: 62, poly errors: 11/2936 errors, non-single-iteration-find: 30/2936 mean 3.3ms, std: 6.5ms, min: 2.5, max: 62.1, poly errors: 20/2769 errors, non-single-iteration-find: 48/2769 Number of iterations of the non-error long computations: 2 iterations: 7 3 iterations: 2 4 iterations: 2 7 iterations: 1 9 iterations: 3 10 iterations 12 iteration: 12 (Check that this is actually a correct polynomial and not a miss found one. Then find out which configuration is this one.) Start val == 9 mean 3.1ms, std: 5.9ms, min: 2.4, max: 59.6, poly errors: 24/2759 errors, non-single-iteration-find: 37/22759

SaidAlvarado commented 8 months ago

Limit iterations

This more aggressively prioritizes the center of the captured data, which is more likely to match, and diminishes how many calculations are needed before an error polynomial is detected

mean 2.7ms, std: 1.2ms, min: 2.5, max: 11.9, poly errors: 45/3407 errors, non-single-iteration-find: 79/3407, results discarded by aggressive prunning: 4

Most non-single-iterations-find went from 5ms to 8.7ms New error poly time: 11.8ms

SaidAlvarado commented 8 months ago

Bug: Start_val=9 and bits_N_comp=39

For some reason, this particular configuration of values. causes a weird behavior were all the weights are computed the same, but only in that particular iteration. image

This is the reason most non-single-iterations-find went from 5ms to 8.7ms

I'll fix it by making sure this particular combination never happens, the tested cases will be:

start_val bit_N_for_comp
8 39
9 38
8 29
9 28
8 19
9 18

Results:

mean 2.5ms, std: 0.7ms, min: 2.5, max: 9.6, poly errors: 33/3714 errors, non-single-iteration-find: 119/3714, results discarded by aggressive prunning: 2

SaidAlvarado commented 8 months ago

_reverse_count_p()

There are two main components to this function:

  1. LSFR Update: 8.26us
  2. Checkpoint Check: 3.88us

The worst case scenario of the reverse_count_p() function is roughly 22.6ms

Improvement 1: LSFR update popcount

The LSFR requires a cumulative XOR operation over 17bits, this requires a for loop, replacing this with an inbuild_popcount ARM instruction makes it much faster.

Current implementation 6.3us

        b17         = buffer & 0x00000001;               // save the "newest" bit of the buffer
        buffer      = (buffer & (0x0001FFFE)) >> 1;      // shift the buffer right, backwards in time
        masked_buff = (buffer) & (_polynomials[index]);  // mask the buffer w/ the selected polynomial
        for (ii = 0; ii < 17; ii++) {
            result = result ^ (((masked_buff) >> ii) & (0x00000001));  // cumulative sum of buffer&poly
        }
        result = result ^ b17;
        buffer = buffer | (result << 16);  // update buffer w/ result
        result = 0;                        // reset result
        count++;

New implementation 320ns

        b17         = buffer & 0x00000001;               // save the "newest" bit of the buffer
        buffer      = (buffer & (0x0001FFFE)) >> 1;      // shift the buffer right, backwards in time
        masked_buff = (buffer) & (_polynomials[index]);  // mask the buffer w/ the selected polynomial
        buffer = buffer | (((__builtin_popcount(masked_buff) ^ b17) & 0x00000001) << 16);
        count++;

Improvement 2: Copy checkpoints to local variables to improve access speed from Flash to RAM

Check point checking went from 4us to 2us

    // Copy const variables (Flash) into local variables (RAM) to speed up execution.
    uint32_t _end_buffers_local[NUM_LSFR_COUNT_CHECKPOINTS];
    uint32_t polynomials_local = _polynomials[index];

    for (size_t i = 0; i < NUM_LSFR_COUNT_CHECKPOINTS; i++)
    {
        _end_buffers_local[i] = _end_buffers[index][i];
    }

Improvement 3: Check checkpoint done with Hashtable

The hash used is the lower 11 bits of the checkpoints. The table is built like this:

void _fill_hash_table(uint8_t * hash_table){

    // Iterate over all the checkpoints and save the HASH_TABLE_BITS 11 bits as a a index for the hashtable
    for (size_t poly = 0; poly < LH2_4_BASESTATION_COUNT*2; poly++)
    {
        for (size_t checkpoint =  1; checkpoint < NUM_LSFR_COUNT_CHECKPOINTS; checkpoint++)
        {
            hash_table[_end_buffers[poly][checkpoint] & HASH_TABLE_MASK] = checkpoint;
        }
    } 
}

And then the table is read like this:

        hash_index = _end_buffers_hashtable[buffer & HASH_TABLE_MASK];
        if (buffer == _end_buffers_local[hash_index]){
            count = count + 8192*hash_index - 1;
            buffer = _end_buffers_local[0];
        }

Worst-case scenario:

Improvement 4: Save last checkpoint

First attempt: Tiny hashtable

Worst-case scenario:

Improvement 5: Increase number of checkpoints

Worst-case scenario: (tested without the dynamic checkpoints)

Dynamic checkpoints ON scenario:

SaidAlvarado commented 8 months ago

Other Random Upgrades

Baseline:

  1. Demodulate light: 1.38ms
  2. Determine Polynomial 2.57ms

The worst case scenario of the reverse_count_p() function is roughly 22.6ms

Improvement 1: Demudulate_light() 64 bits instead of 128

THe SPI buffer size is 64, no need to process 128 bits as it's currently set up to do.

Broke the code, need to look more into this

Improvement 2: Change hamming weight to _pop_count()

Perhaps the inbuilt is faster than fil's implementation.

Nope, Determine_polynomial went from 2.57 to 2.51

Improvement 3: CHange Poly_check to use your faster LFSR implementation

Perhaps the inbuilt is faster than fil's implementation.

Yes, determine polynomial went waaaaaay down.

  1. Determine Polynomial mean: 333us, max: 1.25ms, min:323us
SaidAlvarado commented 7 months ago

Merged into the branch created for issue #268