Potential Improvement? Fixed sized stack / global allocation.

Andersama commented 1 year ago

The same features could be made to work with a fixed sized allocation, the size could be given with a templated parameter, no dynamic allocation required.

template<size_t _size>
class FixedRunningAverage {
//...

RobTillaart commented 1 year ago

Thanks for the idea, Might improve the footprint of the library, and maybe performance? Have you done some testing and compare dynamic allocation vs template?

Please provide a PR as I have little time to investigate.

Andersama commented 1 year ago

I don't think I have the tools to measure performance. However...I do have a mockup I used that's not feature complete in the same way your library is.

template<uint16_t _size>
struct FixedRunningAverage {
protected:
  uint16_t _count;
  uint16_t _index;
  //uint16_t _partial;
  float    _sum;
  float    _array[_size];
  //float    _min;
  //float    _max;
public:
  void clear() {
    _count = 0;
    _index = 0;
    _sum = 0.0;
    for (size_t i = 0; i < _size; i++) {
      _array[i] = NAN;
    }
  }

  void addValue(float value) {
    _sum -= _array[_index];
    _array[_index] = value;
    _sum += value; //_sum += _array[_index];
    _index++;

    if (_index == _size) _index = 0;  // faster than %

    //  handle min max
    //if (_count == 0) _min = _max = value;
    //else if (value < _min) _min = value;
    //else if (value > _max) _max = value;

    //  update count as last otherwise if ( _count == 0) above will fail
    _count += (_count < _size);
    //if (_count < _partial) _count++;
  }

  float getAverage()
  {
    if (_count == 0)
    {
      return NAN;
    }
    //  OPTIMIZE local variable for sum.
    float _new_sum = 0;
    for (uint16_t i = 0; i < _count; i++)
    {
      _new_sum += _array[i];
    }
    _sum = _new_sum;
    return _new_sum / _count;   // multiplication is faster ==> extra admin
  }

  //  the larger the size of the internal buffer 
  //  the greater the gain wrt getAverage()
  float getFastAverage() const
  {
    if (_count == 0)
    {
      return NAN;
    }

    return _sum / _count;   //  multiplication is faster ==> extra admin
  }

  void fillValue(float value, uint16_t count) {
    if (count >= _size) {
      _sum = value * _size;
      for (size_t i = 0; i < _size; i++) {
        _array[i] = value;
      }
      _count = _size;
      return;
    }

    for (size_t i = 0; i < count; i++) {
      addValue(value);
    }
  }
};

RobTillaart commented 1 year ago

Thanks for the code snippet, I can confirm it compiles (works). However as FixedRunningAverage is not functional identical comparison is ambiguous at best.

I don't think I have the tools to measure performance.

You could use ra_performance.ino which does a performance test for the library. (I used it to see if the template version compiled, and after stripping a lot it did)

Andersama commented 1 year ago

I did modify it a bit, was using to smooth temperature readings on my end and I didn't have a need for the min/max tracking. Although I might end up using that soon, so I'm sure in a little bit I'll end up writing what should be a match to the library as a whole anyway.

I did tweak:

  void fillValue(float value, uint16_t count) {
    if (count >= _size) {
      _sum = value * _size;
      for (size_t i = 0; i < _size; i++) {
        _array[i] = value;
      }
      _count = _size;
      return;
    }

    for (size_t i = 0; i < count; i++) {
      addValue(value);
    }
  }

looking at it I realize I missed the clear() call....so that would break things. I was using fillValue thinking it was something like appendValue or maybe appendValues.

RobTillaart commented 1 year ago

might be good to add a line

#define RUNNINGAVERAGE_LIB_VERSION    (F("0.4.3 template version"))

Andersama commented 1 year ago

Ok, I found your ra_performance.ino example, so I modified it like this: (so far I don't have other functions to test), but...it seems to verify there is a potential performance improvement.

Note: I suspect in this case the fixed allocation allows the compiler to completely eliminate all the work of the for loop, so there likely needs to be some non-compile time constant value as an input.

void setup() {
  Serial.begin(115200);
  Serial.print("\n Running Average test for Arduino: ");
  Serial.println(RUNNINGAVERAGE_LIB_VERSION);

  start = micros();
  FixedRunningAverage<16> fixed_16;
  FixedRunningAverage<32> fixed_32;
  FixedRunningAverage<64> fixed_64;
  FixedRunningAverage<128> fixed_128;
  FixedRunningAverage<256> fixed_256;
  stop = micros();
  Serial.print("5 fixed constructors\t");
  Serial.println(stop - start);
  total += (stop - start);

  start = micros();
  RunningAverage dynamic_16(16);
  RunningAverage dynamic_32(32);
  RunningAverage dynamic_64(64);
  RunningAverage dynamic_128(128);
  RunningAverage dynamic_256(256);
  stop = micros();
  Serial.print("5 dynamic constructors\t");
  Serial.println(stop - start);
  total += (stop - start);

  start = micros();
  for(uint16_t i=0; i < 256; i++)
    fixed_16.addValue(100.0f);
  stop = micros();
  Serial.print("fixed_16.addValue(100.0f);\t");
  Serial.println(stop - start);
  total += (stop - start);

  start = micros();
  for(uint16_t i=0; i < 256; i++)
    fixed_32.addValue(100.0f);
  stop = micros();
  Serial.print("fixed_32.addValue(100.0f);\t");
  Serial.println(stop - start);
  total += (stop - start);

  start = micros();
  for(uint16_t i=0; i < 256; i++)
    fixed_64.addValue(100.0f);
  stop = micros();
  Serial.print("fixed_64.addValue(100.0f);\t");
  Serial.println(stop - start);
  total += (stop - start);

    start = micros();
  for(uint16_t i=0; i < 256; i++)
    fixed_128.addValue(100.0f);
  stop = micros();
  Serial.print("fixed_128.addValue(100.0f);\t");
  Serial.println(stop - start);
  total += (stop - start);

  start = micros();
  for(uint16_t i=0; i < 256; i++)
    fixed_256.addValue(100.0f);
  stop = micros();
  Serial.print("fixed_256.addValue(100.0f);\t");
  Serial.println(stop - start);
  total += (stop - start);

  start = micros();
  for(uint16_t i=0; i < 256; i++)
    dynamic_16.addValue(100.0f);
  stop = micros();
  Serial.print("dynamic_16.addValue(100.0f);\t");
  Serial.println(stop - start);
  total += (stop - start);

  start = micros();
  for(uint16_t i=0; i < 256; i++)
    dynamic_32.addValue(100.0f);
  stop = micros();
  Serial.print("dynamic_32.addValue(100.0f);\t");
  Serial.println(stop - start);
  total += (stop - start);

  start = micros();
  for(uint16_t i=0; i < 256; i++)
    dynamic_64.addValue(100.0f);
  stop = micros();
  Serial.print("dynamic_64.addValue(100.0f);\t");
  Serial.println(stop - start);
  total += (stop - start);

    start = micros();
  for(uint16_t i=0; i < 256; i++)
    dynamic_128.addValue(100.0f);
  stop = micros();
  Serial.print("dynamic_128.addValue(100.0f);\t");
  Serial.println(stop - start);
  total += (stop - start);

  start = micros();
  for(uint16_t i=0; i < 256; i++)
    dynamic_256.addValue(100.0f);
  stop = micros();
  Serial.print("dynamic_256.addValue(100.0f);\t");
  Serial.println(stop - start);
  total += (stop - start);
}

Results:

5 fixed constructors    4
5 dynamic constructors  320
fixed_16.addValue(100.0f);      4
fixed_32.addValue(100.0f);      4
fixed_64.addValue(100.0f);      4
fixed_128.addValue(100.0f);     4
fixed_256.addValue(100.0f);     4
dynamic_16.addValue(100.0f);    8236
dynamic_32.addValue(100.0f);    8160
dynamic_64.addValue(100.0f);    8296
dynamic_128.addValue(100.0f);   8272
dynamic_256.addValue(100.0f);   868

I'm not sure what's going on with dynamic_256 either.

RobTillaart commented 1 year ago

Try call addValue(var) And declare volatile float var = 100;

That should prevent compiler optimizations.

RobTillaart commented 1 year ago

Good work btw,👍

Andersama commented 1 year ago

I just tried random(), but I got stranger results after making the fixed template more like the original I definitely made more changes I forgot about.

Something very strange is happening with dynamic_256, it's always finishing far faster than the smaller ones.

Just to double check, it seems allocations potentially 128 or larger have this bizarre issue, what's the memory limit on the Arduino? I think the allocator's failing.

Newer results (w/ volatile) make a bit more sense:

fixed_16.addValue(non_constant);        288
fixed_32.addValue(non_constant);        284
fixed_64.addValue(non_constant);        280
fixed_128.addValue(non_constant);       284
fixed_256.addValue(non_constant);       284 
dynamic_16.addValue(non_constant);      8576
dynamic_32.addValue(non_constant);      8544
dynamic_64.addValue(non_constant);      8680
dynamic_128.addValue(non_constant);     8660
dynamic_256.addValue(non_constant);     1364
dynamic_512.addValue(non_constant);     1364

Andersama commented 1 year ago

I've found that inside clear() that a standard forwards loop appears to be faster:

  for (uint32_t i = 0; i < _size; i++) //508 microseconds
  {
    _array[i] = 0.0f;
  }
  /*
  for (uint16_t i = _size; i > 0; ) //520 microseconds
  {
    _array[--i] = 0.0; // keeps addValue simpler
  }
  */

RobTillaart commented 1 year ago

What board are you testing with? There is a diff uint32 vs uint16 .... Might be the cause of the performance diff

Andersama commented 1 year ago

Currently testing on an arduino uno rev 3 board.

RobTillaart commented 1 year ago

@Andersama Any progress to report, otherwise I close this item. It still is interesting, however low prio for me.

Andersama commented 1 year ago

Oh no worries about closing, I think there was a difference between the index sizes, I'll have to find my arduinos again.

RobTillaart / RunningAverage

Potential Improvement? Fixed sized stack / global allocation. #23