SaidAlvarado commented 8 months ago

LH2 Code Benchmark:

SPI ISR: 11.45 us +- 15 ns
- _add_to_spi_ring_buffer(): 10.55us +- 10.5 ns
  - (original code was 8.3us)
- db_timer_hf_now(): 711 ns +-9.9ns
- db_lh2_4_process_raw_data(): 15.04 ms +- 23.63 (min: 4.7, max: 95.3) (74/1185 error vs. good poly) ( the poly errors cause the 95ms worst case.)
- demodulate_light(): 1.37ms +- 29us
- select_polynomial(): 17.0ms +- 27.1ms (min: 3.12, max: 90.0) (78/793 errors)
- process_location():
- reverse_count_p() 44.59ms +- 29.26ms (min: 1.6 max: 98.27)

The main bottle necks are select_polynomial(), and reverse_count_p(). Their worst case scenarios stall the CPU for too long. If we can optimize them to run faster, or prioritize speed over packet loss, we might be able to get the system running real time.

SaidAlvarado commented 8 months ago

Select_Polynomial()

Change *start_val

most of the the bit errors occur early in the spi sequence, if you change *start_val, to a higher value than 0, the function usually finds the polynomial in a single iteration. At the expense of more polynomial errors.

The following test used 4 lighthouses, at 1~2 meter distance

units are in miliseconds Start val == 0 mean 7.4ms, std: 10.4ms, min: 3.0, max: 84.7, poly errors: 31/2928 errors, non-single-iteration-find: 1144/2928 Start val == 1 mean 5.5ms, std: 8.4ms, min: 2.9, max: 81.5, poly errors: 21/2920 errors, non-single-iteration-find: 686/2920 Start val == 2 mean 4.4ms, std: 6.5ms, min: 2.9, max: 78.5, poly errors: 29/5153 errors, non-single-iteration-find: 809/5153 Start val == 3 mean 4.7ms, std: 9.5ms, min: 2.8, max: 75.7, poly errors: 32/2122 errors, non-single-iteration-find: 226/2122 Start val == 4 mean 3.7ms, std: 5.6ms, min: 2.9, max: 77.3, poly errors: 14/3060 errors, non-single-iteration-find: 189/3060 Start val == 5 mean 3.7ms, std: 7.7ms, min: 2.7, max: 70.1, poly errors: 33/3000 errors, non-single-iteration-find: 126/3000 Start val == 6 mean 3.4ms, std: 6.1ms, min: 2.6, max: 67.3, poly errors: 24/3054 errors, non-single-iteration-find: 126/3054 Start val == 7 mean 2.8ms, std: 3.7ms, min: 2.6, max: 64.7, poly errors: 7/2419 errors, non-single-iteration-find: 27/2419 Start val == 8 mean 2.9ms, std: 4.6ms, min: 2.5, max: 62, poly errors: 11/2936 errors, non-single-iteration-find: 30/2936 mean 3.3ms, std: 6.5ms, min: 2.5, max: 62.1, poly errors: 20/2769 errors, non-single-iteration-find: 48/2769 Number of iterations of the non-error long computations: 2 iterations: 7 3 iterations: 2 4 iterations: 2 7 iterations: 1 9 iterations: 3 10 iterations 12 iteration: 12 (Check that this is actually a correct polynomial and not a miss found one. Then find out which configuration is this one.) Start val == 9 mean 3.1ms, std: 5.9ms, min: 2.4, max: 59.6, poly errors: 24/2759 errors, non-single-iteration-find: 37/22759

SaidAlvarado commented 8 months ago

Limit iterations

Start_val: 8 to 9
Threshold: 4 to 2 (threshold 1 causes miss calculations that would be better to throw away.)
bit_N_for_comp: 39 to 19 (going down 10 by 10)

This more aggressively prioritizes the center of the captured data, which is more likely to match, and diminishes how many calculations are needed before an error polynomial is detected

mean 2.7ms, std: 1.2ms, min: 2.5, max: 11.9, poly errors: 45/3407 errors, non-single-iteration-find: 79/3407, results discarded by aggressive prunning: 4

Most non-single-iterations-find went from 5ms to 8.7ms New error poly time: 11.8ms

SaidAlvarado commented 8 months ago

Bug: Start_val=9 and bits_N_comp=39

For some reason, this particular configuration of values. causes a weird behavior were all the weights are computed the same, but only in that particular iteration.

This is the reason most non-single-iterations-find went from 5ms to 8.7ms

I'll fix it by making sure this particular combination never happens, the tested cases will be:

start_val	bit_N_for_comp
8	39
9	38
8	29
9	28
8	19
9	18

Results:

mean 2.5ms, std: 0.7ms, min: 2.5, max: 9.6, poly errors: 33/3714 errors, non-single-iteration-find: 119/3714, results discarded by aggressive prunning: 2

SaidAlvarado commented 8 months ago

_reverse_count_p()

There are two main components to this function:

LSFR Update: 8.26us
Checkpoint Check: 3.88us

The worst case scenario of the reverse_count_p() function is roughly 22.6ms

Improvement 1: LSFR update popcount

The LSFR requires a cumulative XOR operation over 17bits, this requires a for loop, replacing this with an inbuild_popcount ARM instruction makes it much faster.

Current implementation 6.3us

        b17         = buffer & 0x00000001;               // save the "newest" bit of the buffer
        buffer      = (buffer & (0x0001FFFE)) >> 1;      // shift the buffer right, backwards in time
        masked_buff = (buffer) & (_polynomials[index]);  // mask the buffer w/ the selected polynomial
        for (ii = 0; ii < 17; ii++) {
            result = result ^ (((masked_buff) >> ii) & (0x00000001));  // cumulative sum of buffer&poly
        }
        result = result ^ b17;
        buffer = buffer | (result << 16);  // update buffer w/ result
        result = 0;                        // reset result
        count++;

New implementation 320ns

        b17         = buffer & 0x00000001;               // save the "newest" bit of the buffer
        buffer      = (buffer & (0x0001FFFE)) >> 1;      // shift the buffer right, backwards in time
        masked_buff = (buffer) & (_polynomials[index]);  // mask the buffer w/ the selected polynomial
        buffer = buffer | (((__builtin_popcount(masked_buff) ^ b17) & 0x00000001) << 16);
        count++;

Improvement 2: Copy checkpoints to local variables to improve access speed from Flash to RAM

Check point checking went from 4us to 2us

    // Copy const variables (Flash) into local variables (RAM) to speed up execution.
    uint32_t _end_buffers_local[NUM_LSFR_COUNT_CHECKPOINTS];
    uint32_t polynomials_local = _polynomials[index];

    for (size_t i = 0; i < NUM_LSFR_COUNT_CHECKPOINTS; i++)
    {
        _end_buffers_local[i] = _end_buffers[index][i];
    }

Improvement 3: Check checkpoint done with Hashtable

The hash used is the lower 11 bits of the checkpoints. The table is built like this:

void _fill_hash_table(uint8_t * hash_table){

    // Iterate over all the checkpoints and save the HASH_TABLE_BITS 11 bits as a a index for the hashtable
    for (size_t poly = 0; poly < LH2_4_BASESTATION_COUNT*2; poly++)
    {
        for (size_t checkpoint =  1; checkpoint < NUM_LSFR_COUNT_CHECKPOINTS; checkpoint++)
        {
            hash_table[_end_buffers[poly][checkpoint] & HASH_TABLE_MASK] = checkpoint;
        }
    } 
}

And then the table is read like this:

        hash_index = _end_buffers_hashtable[buffer & HASH_TABLE_MASK];
        if (buffer == _end_buffers_local[hash_index]){
            count = count + 8192*hash_index - 1;
            buffer = _end_buffers_local[0];
        }

Worst-case scenario:

reverse_count_p(): Went from 98.6ms to 10.6ms
Checkpoint check: 4.02us to 420ns

Improvement 4: Save last checkpoint

First attempt: Tiny hashtable

Worst-case scenario:

reverse_count_p(): 9.25ms
Checkpoint check: 1.4us
For some reason this checkpoint system only triggers it's benefits 5/47 times.
Second attempt: Save one checkpoint per sweep.

Worst-case scenario:
reverse_count_p(): ave: 23us, max: 466us, min:12us (if there is a checkpoint failure, it still takes 10ms to count the worst case scenario)
Checkpoint check: 1.4us Checkpoint system can be inefective ifdue to lag, the LH constantly skips the same sweep, biasing the running average too much towards one side.

Improvement 5: Increase number of checkpoints

Worst-case scenario: (tested without the dynamic checkpoints)

reverse_count_p(): 2.9ms
Checkpoint check: 1.7us

Dynamic checkpoints ON scenario:

reverse_count_p(): mean: 791us, min: 34.5us, 4.134ms

SaidAlvarado commented 8 months ago

Other Random Upgrades

Baseline:

Demodulate light: 1.38ms
Determine Polynomial 2.57ms

The worst case scenario of the reverse_count_p() function is roughly 22.6ms

Improvement 1: Demudulate_light() 64 bits instead of 128

THe SPI buffer size is 64, no need to process 128 bits as it's currently set up to do.

Broke the code, need to look more into this

Improvement 2: Change hamming weight to _pop_count()

Perhaps the inbuilt is faster than fil's implementation.

Nope, Determine_polynomial went from 2.57 to 2.51

Improvement 3: CHange Poly_check to use your faster LFSR implementation

Perhaps the inbuilt is faster than fil's implementation.

Yes, determine polynomial went waaaaaay down.

Determine Polynomial mean: 333us, max: 1.25ms, min:323us

SaidAlvarado commented 7 months ago

Merged into the branch created for issue #268

DotBots / DotBot-firmware

Optimizing LH2 code #271

LH2 Code Benchmark:

Select_Polynomial()

Change *start_val

Limit iterations

Bug: Start_val=9 and bits_N_comp=39

Results:

_reverse_count_p()

Improvement 1: LSFR update popcount

Improvement 2: Copy checkpoints to local variables to improve access speed from Flash to RAM

Improvement 3: Check checkpoint done with Hashtable

Improvement 4: Save last checkpoint

First attempt: Tiny hashtable

Second attempt: Save one checkpoint per sweep.

Improvement 5: Increase number of checkpoints

Other Random Upgrades

Improvement 1: Demudulate_light() 64 bits instead of 128

Improvement 2: Change hamming weight to _pop_count()

Improvement 3: CHange Poly_check to use your faster LFSR implementation