Closed Andersama closed 1 year ago
Thanks for the idea, Might improve the footprint of the library, and maybe performance? Have you done some testing and compare dynamic allocation vs template?
Please provide a PR as I have little time to investigate.
I don't think I have the tools to measure performance. However...I do have a mockup I used that's not feature complete in the same way your library is.
template<uint16_t _size>
struct FixedRunningAverage {
protected:
uint16_t _count;
uint16_t _index;
//uint16_t _partial;
float _sum;
float _array[_size];
//float _min;
//float _max;
public:
void clear() {
_count = 0;
_index = 0;
_sum = 0.0;
for (size_t i = 0; i < _size; i++) {
_array[i] = NAN;
}
}
void addValue(float value) {
_sum -= _array[_index];
_array[_index] = value;
_sum += value; //_sum += _array[_index];
_index++;
if (_index == _size) _index = 0; // faster than %
// handle min max
//if (_count == 0) _min = _max = value;
//else if (value < _min) _min = value;
//else if (value > _max) _max = value;
// update count as last otherwise if ( _count == 0) above will fail
_count += (_count < _size);
//if (_count < _partial) _count++;
}
float getAverage()
{
if (_count == 0)
{
return NAN;
}
// OPTIMIZE local variable for sum.
float _new_sum = 0;
for (uint16_t i = 0; i < _count; i++)
{
_new_sum += _array[i];
}
_sum = _new_sum;
return _new_sum / _count; // multiplication is faster ==> extra admin
}
// the larger the size of the internal buffer
// the greater the gain wrt getAverage()
float getFastAverage() const
{
if (_count == 0)
{
return NAN;
}
return _sum / _count; // multiplication is faster ==> extra admin
}
void fillValue(float value, uint16_t count) {
if (count >= _size) {
_sum = value * _size;
for (size_t i = 0; i < _size; i++) {
_array[i] = value;
}
_count = _size;
return;
}
for (size_t i = 0; i < count; i++) {
addValue(value);
}
}
};
Thanks for the code snippet, I can confirm it compiles (works). However as FixedRunningAverage is not functional identical comparison is ambiguous at best.
I don't think I have the tools to measure performance.
You could use ra_performance.ino which does a performance test for the library. (I used it to see if the template version compiled, and after stripping a lot it did)
I did modify it a bit, was using to smooth temperature readings on my end and I didn't have a need for the min/max tracking. Although I might end up using that soon, so I'm sure in a little bit I'll end up writing what should be a match to the library as a whole anyway.
I did tweak:
void fillValue(float value, uint16_t count) {
if (count >= _size) {
_sum = value * _size;
for (size_t i = 0; i < _size; i++) {
_array[i] = value;
}
_count = _size;
return;
}
for (size_t i = 0; i < count; i++) {
addValue(value);
}
}
looking at it I realize I missed the clear() call....so that would break things. I was using fillValue thinking it was something like appendValue
or maybe appendValues
.
might be good to add a line
#define RUNNINGAVERAGE_LIB_VERSION (F("0.4.3 template version"))
Ok, I found your ra_performance.ino example, so I modified it like this: (so far I don't have other functions to test), but...it seems to verify there is a potential performance improvement.
Note: I suspect in this case the fixed allocation allows the compiler to completely eliminate all the work of the for loop, so there likely needs to be some non-compile time constant value as an input.
void setup() {
Serial.begin(115200);
Serial.print("\n Running Average test for Arduino: ");
Serial.println(RUNNINGAVERAGE_LIB_VERSION);
start = micros();
FixedRunningAverage<16> fixed_16;
FixedRunningAverage<32> fixed_32;
FixedRunningAverage<64> fixed_64;
FixedRunningAverage<128> fixed_128;
FixedRunningAverage<256> fixed_256;
stop = micros();
Serial.print("5 fixed constructors\t");
Serial.println(stop - start);
total += (stop - start);
start = micros();
RunningAverage dynamic_16(16);
RunningAverage dynamic_32(32);
RunningAverage dynamic_64(64);
RunningAverage dynamic_128(128);
RunningAverage dynamic_256(256);
stop = micros();
Serial.print("5 dynamic constructors\t");
Serial.println(stop - start);
total += (stop - start);
start = micros();
for(uint16_t i=0; i < 256; i++)
fixed_16.addValue(100.0f);
stop = micros();
Serial.print("fixed_16.addValue(100.0f);\t");
Serial.println(stop - start);
total += (stop - start);
start = micros();
for(uint16_t i=0; i < 256; i++)
fixed_32.addValue(100.0f);
stop = micros();
Serial.print("fixed_32.addValue(100.0f);\t");
Serial.println(stop - start);
total += (stop - start);
start = micros();
for(uint16_t i=0; i < 256; i++)
fixed_64.addValue(100.0f);
stop = micros();
Serial.print("fixed_64.addValue(100.0f);\t");
Serial.println(stop - start);
total += (stop - start);
start = micros();
for(uint16_t i=0; i < 256; i++)
fixed_128.addValue(100.0f);
stop = micros();
Serial.print("fixed_128.addValue(100.0f);\t");
Serial.println(stop - start);
total += (stop - start);
start = micros();
for(uint16_t i=0; i < 256; i++)
fixed_256.addValue(100.0f);
stop = micros();
Serial.print("fixed_256.addValue(100.0f);\t");
Serial.println(stop - start);
total += (stop - start);
start = micros();
for(uint16_t i=0; i < 256; i++)
dynamic_16.addValue(100.0f);
stop = micros();
Serial.print("dynamic_16.addValue(100.0f);\t");
Serial.println(stop - start);
total += (stop - start);
start = micros();
for(uint16_t i=0; i < 256; i++)
dynamic_32.addValue(100.0f);
stop = micros();
Serial.print("dynamic_32.addValue(100.0f);\t");
Serial.println(stop - start);
total += (stop - start);
start = micros();
for(uint16_t i=0; i < 256; i++)
dynamic_64.addValue(100.0f);
stop = micros();
Serial.print("dynamic_64.addValue(100.0f);\t");
Serial.println(stop - start);
total += (stop - start);
start = micros();
for(uint16_t i=0; i < 256; i++)
dynamic_128.addValue(100.0f);
stop = micros();
Serial.print("dynamic_128.addValue(100.0f);\t");
Serial.println(stop - start);
total += (stop - start);
start = micros();
for(uint16_t i=0; i < 256; i++)
dynamic_256.addValue(100.0f);
stop = micros();
Serial.print("dynamic_256.addValue(100.0f);\t");
Serial.println(stop - start);
total += (stop - start);
}
Results:
5 fixed constructors 4
5 dynamic constructors 320
fixed_16.addValue(100.0f); 4
fixed_32.addValue(100.0f); 4
fixed_64.addValue(100.0f); 4
fixed_128.addValue(100.0f); 4
fixed_256.addValue(100.0f); 4
dynamic_16.addValue(100.0f); 8236
dynamic_32.addValue(100.0f); 8160
dynamic_64.addValue(100.0f); 8296
dynamic_128.addValue(100.0f); 8272
dynamic_256.addValue(100.0f); 868
I'm not sure what's going on with dynamic_256 either.
Try call addValue(var) And declare volatile float var = 100;
That should prevent compiler optimizations.
Good work btw,👍
I just tried random()
, but I got stranger results after making the fixed template more like the original I definitely made more changes I forgot about.
Something very strange is happening with dynamic_256, it's always finishing far faster than the smaller ones.
Just to double check, it seems allocations potentially 128 or larger have this bizarre issue, what's the memory limit on the Arduino? I think the allocator's failing.
Newer results (w/ volatile) make a bit more sense:
fixed_16.addValue(non_constant); 288
fixed_32.addValue(non_constant); 284
fixed_64.addValue(non_constant); 280
fixed_128.addValue(non_constant); 284
fixed_256.addValue(non_constant); 284
dynamic_16.addValue(non_constant); 8576
dynamic_32.addValue(non_constant); 8544
dynamic_64.addValue(non_constant); 8680
dynamic_128.addValue(non_constant); 8660
dynamic_256.addValue(non_constant); 1364
dynamic_512.addValue(non_constant); 1364
I've found that inside clear() that a standard forwards loop appears to be faster:
for (uint32_t i = 0; i < _size; i++) //508 microseconds
{
_array[i] = 0.0f;
}
/*
for (uint16_t i = _size; i > 0; ) //520 microseconds
{
_array[--i] = 0.0; // keeps addValue simpler
}
*/
What board are you testing with? There is a diff uint32 vs uint16 .... Might be the cause of the performance diff
Currently testing on an arduino uno rev 3 board.
@Andersama Any progress to report, otherwise I close this item. It still is interesting, however low prio for me.
Oh no worries about closing, I think there was a difference between the index sizes, I'll have to find my arduinos again.
The same features could be made to work with a fixed sized allocation, the size could be given with a templated parameter, no dynamic allocation required.