flrs / HampelFilter

Arduino library for identifying outliers with a Hampel filter
Other
28 stars 5 forks source link

How does the scaling factor work ? #1

Closed JacoFourie closed 6 years ago

JacoFourie commented 6 years ago

Hi. I am playing with this lib to see if it will do what I need. But I don't understand the scaling factor. I need the find outliers that spike past lets say 5. if I have the temperature coming from a probe (28.5 , 28.9 , 29 , 29.6 , 38.8 , 29,9) then 38.8 will be the outlier. So I need the range of around 5 as the temp can not jump 5 degrees from one reeding to the next. What should the factor be then as I am getting small values shown as outliers when I use 5 or even 50.

JacoFourie commented 6 years ago

No matter what I put in that factor the same value will say outlier the one reading and then next it will say it is OK. So I don't understand this lib at all or it is not working.

flrs commented 6 years ago

Hi @JacoFourie,

I am sorry you are having such trouble. I can answer you more extensively later, but here is an example in Matlab of what should be going on inside the library. I am giving you an example for a window size of four points.

This is what happens when you check 38.8 against the buffer:

>> vals = [28.5, 28.9, 29, 29.6];
>> median_vals = median(vals)

median_vals =

  28.949999999999999

>> deviations = abs(vals-median_vals)

deviations =

   0.449999999999999   0.050000000000001   0.050000000000001   0.650000000000002

>> median_deviations = median(deviations)

median_deviations =

   0.250000000000000

>> scaling_factor = 3;
>> threshold_check_value = median_deviations*scaling_factor

threshold_check_value =

   0.750000000000000

>> abs(38.8-median_vals)

ans =

   9.849999999999998

>> abs(38.8-median_vals) > threshold_check_value

ans =

  logical

   1

>> 'since deviation is bigger than threshold_check_value, this point is an outlier'

Now, assuming you have added 38.8 to the buffer, this is what should happen:

>> vals = [28.9, 29, 29.6, 38.8];
>> median_vals = median(vals)

median_vals =

  29.300000000000001

>> deviations = abs(vals-median_vals)

deviations =

   0.400000000000002   0.300000000000001   0.300000000000001   9.499999999999996

>> median_deviations = median(deviations)

median_deviations =

   0.350000000000001

>> scaling_factor = 3;
>> threshold_check_value = median_deviations*scaling_factor

threshold_check_value =

   1.050000000000004

>> abs(29.9-median_vals)

ans =

   0.599999999999998

>> abs(29.9-median_vals) > threshold_check_value

ans =

  logical

   0

>> 'since deviation is bigger than threshold_check_value, this point is not an outlier'
JacoFourie commented 6 years ago

Hi. Thanks for the reply. Let me first understand how to use the lib. What I am doing is the following. I will add the temperature raw value to the buffer. There is a bug BTW. The median will be 0 until you have 4 elements in the buffer. I see that is fixed the the RunningMedia lib.

Then once I have have a value coming back in readMedian then before I add the value to the buffer with write I will do a check to see if it is an outlier. I don't want to add large spikes to my median. I am using RunningMedian now but because the sensor glitches from time to time it will screw the chart and data.

I want to detect outliers and if I have one read the data from the sensor again. I have it working now using another method but this seems to be a better solution if it works the way I want it to.

Here you can see my feed and those large spikes on the temp needs to be dropped out. Change the time scale at the top to see into the history.

https://demo.thingsboard.io/dashboards/fba6c7c0-f75d-11e7-abe9-1d8d2edf4f93?publicId=62079880-f6e8-11e7-abe9-1d8d2edf4f93

JacoFourie commented 6 years ago

OK so when I tested the lib I just added all the values to the buffer as they come in from the sensor and did a outlier check and expected all the values to not be outliers as they are all so close to each other. But some of them was shown as outliers. So I am confused as to why it would be.

JacoFourie commented 6 years ago

OK like now I am running the code with a buffer of 11 and a factor of 3. I have values like 27.74 27.72 27.71 27.70 27.66

and then 27.64 27.62 is seen as a outlier

and then from that point onward all is seen as outliers as I am not adding them to the buffer.

Only spikes should be seen as outliers. Do I then not understand what this lib does ?

JacoFourie commented 6 years ago

This is the code snippit


        Serial.print("Raw Temp value : ");
        Serial.println(temp_c); 

        //If we have a median then check for outliers before we add it to the buffer
        if (tempBuffer.readMedian() > 0){

            if(tempBuffer.checkIfOutlier(temp_c) == true){
                    Serial.print(temp_c);
                    Serial.println(" : IS a outlier");   
                    //Find a new value from the sensor and add it                 
            }else{
                    Serial.print(temp_c);
                    Serial.println(" : is NOT a outlier");
                    //Add the value to the buffer
                    tempBuffer.write(temp_c);
            }

        }else {

          //Add the value to the buffer
          tempBuffer.write(temp_c);
        }
flrs commented 6 years ago

Quick question: What is the default value (default_val) that you initialize the buffer with? The initialization command should be structured like this: HampelFilter(float i_default_val, uint8_t i_window_size, float i_scaling_factor)

What You Should Do

Let me explain how the filter works.

That might help you to understand my reasoning:

The buffer is a "sliding window", that is filled with a fixed amount of points in chronological order. From all points in the buffer, the filter will find the median. In comparison to the average, the median is not influenced by outliers, so it is more stable. You can imagine that if you take the median and your buffer is big enough, the point that is taken as the median is probably not an outlier. Checking the median will allow the filter to follow gradual changes in the "true" (=undistorted) value of the signal.

In a next step, the filter calculates all absolute values of differences between that median and all other points in the buffer. The filter then takes the median value of these differences, this is called MAD (median absolute difference). The goal here is to find out how much the values in the buffer are typically "spread out". This is important for the filter to adapt to kind of noisy but steady signals.

Now the filter can help you decide if a point is an outlier as follows: The filter calculates the difference of the value of the point to the median value of the buffer. Then it checks how that difference compares to the MAD. With the scaling parameter, you decide what the threshold is to decide if your point is an outlier. The filter will say that a point is an outlier if its distance to the median of the buffer is greater than the scaling factor times the MAD.

There you have it: A filter for outliers that adapts to the general noisiness of the signal and follows the general trend of the signal.

Why You Should Do It

Increase Buffer Size

Your buffer has a size of 10 and you add a value every second. Imagine that you have a very clean and slow changing signal and you are taking 10 samples into your buffer of essentially the same value. Any value that is different (for example when the measured temperature suddenly rises half a degree every 20 seconds) will naturally be identified as an outlier. But if you would increase your buffer size (= record over a longer time at the same sample rate), for example to capture 40 seconds worth of data, such small changes would be expected and new values would not be identified as outliers.

Pick a Larger Number for the Scaling Factor

Picking the right scaling factor (and even buffer size) depends a lot on the characteristics of your data. It works best by experiment. However, it seems that in your case the filter simply does not expect such a "big" difference from the MAD. You can fix that by using a bigger scaling factor. However, buffer size and scaling factor are interrelated, so you need to play around with them.

Include Outliers in the Buffer

This is the magic trick. Remember that all operations regarding the buffer in the filter use the median function? Then, it does not matter if there are outliers in the buffer, as long as the number of outliers in comparison to the number of "real" values in the buffer is relatively small. Here is the advantage that you get with including outliers: If your signal is noisy, but slowly changes its real value, then the filter will still be able to follow that trend. This is an essential function, because imagine that after a whole lot of outliers your temperature has actually changed and you have not added the outliers to the buffer. The filter will always identify new measurements as outliers, since it has no chance of seeing that something has actually changed. If it had the outliers included, there is at least still a chance that there is a median value in the filter that still represents the overall trend.

Notes About the Filter

Please note that technically, including outliers does not count as a real "Hampel" filter. In a real "Hampel" filter, outliers are replaced by the median value of the buffer. Since I did not have many outliers in data I worked with in the past, I went without that functionality. Below is a plot of data I worked with and used the filter in. It did a good job!

image

Let me know if that clarifies your issue.

JacoFourie commented 6 years ago

OK so I made the buffer 51 and set the start value to 27. The sensor is sending values back

28.20 28.21 28.22

It will say that 28.21 and 29.20 is an outlier but not 28.22. I have set the factor to 300. It should only show bigger values as outliers not 0.01 differences.

flrs commented 6 years ago
  1. Do you mean "It will say that 28.21 and 28.20 is an outlier but not 28.22"?

  2. Also, does that happen right after initializing the buffer? (At that point all except 1,2 or 3 values, respectively, in the buffer are 27).

JacoFourie commented 6 years ago

OK so here is the code and the output,

HampelFilter tempBuffer = HampelFilter(28.50 , 51 , 300);

    Serial.print("Raw Temp value : ");
    Serial.println(temp_c); 

    //Add the value to the buffer
    tempBuffer.write(temp_c);

    //If we have a median then check for outliers before we add it to the buffer
    if (tempBuffer.readMedian() > 0){

        if(tempBuffer.checkIfOutlier(temp_c) == true){
                Serial.print(temp_c);
                Serial.println(" : IS a outlier");   
                //Find a new value from the sensor and add it                 
        }else{
                Serial.print(temp_c);
                Serial.println(" : is NOT a outlier");
                //Add the value to the buffer
                tempBuffer.write(temp_c);
        }                     
    }

''' and here is the output from the start.

1

and here you can see it says the same value is an outlier and not an outlier. 2

Pity it seems this lib will not do what I need it to do as I can not figure out how to get it to increase the margin or tolerance before it will say the value is wrong.

JacoFourie commented 6 years ago

I need 33.5 and 23.5 to be outliers. Not 28.51 and 28.49.

flrs commented 6 years ago

The Filter is Working Correctly

Alright, I reproduced your code in a Matlab script, since I do not have my C++ gear at hand. The filter shows expected behavior and is therefore working correctly:

default_val = 28.5;
buffer_len = 51;
scaling_factor = 3000;
add_vals = [28.49 28.49 28.5 28.51 28.51 28.51 28.51 28.5 28.52 28.5 28.51 28.52 ...
    28.51 28.52 28.52 28.52 28.52 28.51 28.52 28.51 28.52 28.53];

buffer = ones([buffer_len 1])*default_val;

for val = 1:numel(add_vals)
    fprintf(['Raw Temp value : ', num2str(add_vals(val)), '\n']);
    cur_median = median(buffer);
    median_abs_devs = abs(buffer-cur_median);
    mad = median(median_abs_devs);

    cur_pt_deviation_to_median = abs(add_vals(val)-cur_median);
    if cur_pt_deviation_to_median > mad*scaling_factor*1.48
        fprintf([num2str(add_vals(val)), ' : IS a outlier\n']);
    else
        fprintf([num2str(add_vals(val)), ' : is NOT a outlier\n']);
    end

    buffer(2:end) = buffer(1:end-1);
    buffer(1) = add_vals(val);
end

produces:

Raw Temp value : 28.49
28.49 : IS a outlier
Raw Temp value : 28.49
28.49 : IS a outlier
Raw Temp value : 28.5
28.5 : is NOT a outlier
Raw Temp value : 28.51
28.51 : IS a outlier
Raw Temp value : 28.51
28.51 : IS a outlier
Raw Temp value : 28.51
28.51 : IS a outlier
Raw Temp value : 28.51
28.51 : IS a outlier
Raw Temp value : 28.5
28.5 : is NOT a outlier
Raw Temp value : 28.52
28.52 : IS a outlier
Raw Temp value : 28.5
28.5 : is NOT a outlier
Raw Temp value : 28.51
28.51 : IS a outlier
Raw Temp value : 28.52
28.52 : IS a outlier
Raw Temp value : 28.51
28.51 : IS a outlier
Raw Temp value : 28.52
28.52 : IS a outlier
Raw Temp value : 28.52
28.52 : IS a outlier
Raw Temp value : 28.52
28.52 : IS a outlier
Raw Temp value : 28.52
28.52 : IS a outlier
Raw Temp value : 28.51
28.51 : IS a outlier
Raw Temp value : 28.52
28.52 : IS a outlier
Raw Temp value : 28.51
28.51 : IS a outlier
Raw Temp value : 28.52
28.52 : IS a outlier
Raw Temp value : 28.53
28.53 : IS a outlier

Problem: Differences are Very Small

What is happening in your scenario is that the differences between your measurement values are very small. With the resolution being relatively coarse (only two digits), the MAD goes to zero. Then, when you multiply that MAD with your scaling factor, you still get a threshold of zero difference to the median to identify outliers. This is not good.

Solution: Set a Lower Bound

Since you seem to know the physical nature of your problem (measuring a temperature within 1/100th of a degree), why don't you put some of that knowledge into your filter? I would set a hard-coded lower limit to identify points as outliers. For example, based on your experience it might be reasonable to assume that when you try to measure 28.51 degrees, the measurements that you record are within +/- 0.05 deg of that temperature. However, there is a flipside to this assumption: Any temperature measurement within +/- 0.05 deg of the temperature you are trying to measure would never get identified as outlier. You said outliers would be around 23 and 33, so the limit could even be higher than 0.05 deg if you want to play it safe.

Here is how I implemented it in Matlab:

default_val = 28.5;
buffer_len = 51;
scaling_factor = 3000;
add_vals = [28.49 28.49 28.5 28.51 28.51 28.51 28.51 28.5 28.52 28.5 ...
    28.51 28.52 28.51 28.52 28.52 28.52 28.52 28.51 28.52 28.51 28.52 ...
    28.53];

buffer = ones([buffer_len 1])*default_val;

for val = 1:numel(add_vals)
    fprintf(['Raw Temp value : ', num2str(add_vals(val)), '\n']);
    cur_median = median(buffer);
    median_abs_devs = abs(buffer-cur_median);
    mad = median(median_abs_devs);

    cur_pt_deviation_to_median = abs(add_vals(val)-cur_median);
    if cur_pt_deviation_to_median > mad*scaling_factor*1.48 ...
            && cur_pt_deviation_to_median > 0.05 % <-- here!
        fprintf([num2str(add_vals(val)), ' : IS a outlier\n']);
    else
        fprintf([num2str(add_vals(val)), ' : is NOT a outlier\n']);
    end

    buffer(2:end) = buffer(1:end-1);
    buffer(1) = add_vals(val);
end

In your code, I think it should be something like this (untested):

    Serial.print("Raw Temp value : ");
    Serial.println(temp_c); 

    //Add the value to the buffer
    tempBuffer.write(temp_c);

    //If we have a median then check for outliers before we add it to the buffer
    if (tempBuffer.readMedian() > 0){

        if(tempBuffer.readMAD() > 0.05 && tempBuffer.checkIfOutlier(temp_c) == true){
                Serial.print(temp_c);
                Serial.println(" : IS a outlier");   
                //Find a new value from the sensor and add it                 
        }else{
                Serial.print(temp_c);
                Serial.println(" : is NOT a outlier");
                //Add the value to the buffer
                //tempBuffer.write(temp_c); (value already added above)
        }                     
    }

If this solution does not work for you and you want to identify smaller differences as outliers, you would have to find a way to increase the resolution of the library. This can be done for example by getting rid of decades in favor of another decimal in your measurement (85.24 degrees instead of 28.52).

flrs commented 6 years ago

Also, check if a point is an outlier before you add it to the buffer, like that:

    Serial.print("Raw Temp value : ");
    Serial.println(temp_c); 

    if (tempBuffer.readMedian() > 0){
        if(tempBuffer.readMAD() > 0.05 && tempBuffer.checkIfOutlier(temp_c) == true){
                Serial.print(temp_c);
                Serial.println(" : IS a outlier");     
        }else{
                Serial.print(temp_c);
                Serial.println(" : is NOT a outlier");
        }                     
    }

    //Add the value to the buffer
    tempBuffer.write(temp_c);
JacoFourie commented 6 years ago

Thanks for all the info. I will play with it and see.

I got what I needed by using the RunningMedian lib for now using this code.

//If the buffer has more than 5 values in
        if(s_temp.getCount() >= 5){

           //Check if the temp values is in range

           //If we have a spike
           if(temp_c > (s_temp.getHighest()  + 10 )){

               Serial.println("We have a spike");
               sensor_loop = 0;
              //Read new values
              //Sit in a loop until we get a valid value or after 10 times
              while( temp_c > (s_temp.getHighest()  + 10 ) && sensor_loop < 10) {
                  humidity = sht1x.readHumidity();
                  temp_c = sht1x.retrieveTemperatureC();
                  sensor_loop++;
                  Serial.print("Loop count : ");
                  Serial.println(sensor_loop);
              }
              //If we did not get a valid value then so nothing
               Serial.print("Raw Temp value : ");
               Serial.println(temp_c);           
           }else {

               //The value is in range
                s_rh.add(humidity);
                s_temp.add(temp_c);

           }

        }else {

          s_rh.add(humidity);
          s_temp.add(temp_c); 

        }

But I still need to find a way to figure out if a sensor has gone faulty. My idea was to use this lib and compare 2 sensors. There values should be around the same values. And if one starts to get a lot of outliers then something is wrong with it. Or something along those lines.

JacoFourie commented 6 years ago

OK cool. That seemed to do the trick. I will let it run and see but I tested it by holding the probe so the temp can climb and all was seen as valid values. Then I caused the spike by disrupting the data line and my code handled it. When it got out of the loop your lib also saw it was an outlier as I expected it to be. So that 0.05 can be the factor that I use to create the band of upper and lower limit to catch the bad data. Do I still need 50 data points and the factor of 300 when I construct the lib object ?

3

JacoFourie commented 6 years ago

I will add the code and let it run during a new drying process. The probe looses comms from time to time. And then it will send back a value of 600+ . Becuase I was using the runningmidian it smoothed the data but it still makes a spike. as you can see here. I work out the wet bulb temp using a formula that requires the air temperature and RH value. Because the temp is spiking the wet bulb is now also doing the same.

4

New the spikes should be gone and the true values will be shown on the dashboard.

flrs commented 6 years ago

But I still need to find a way to figure out if a sensor has gone faulty. My idea was to use this lib and compare 2 sensors. There values should be around the same values. And if one starts to get a lot of outliers then something is wrong with it. Or something along those lines.

To find out if a sensor is faulty, comparing two sensors seems like a reasonable approach to me.

How are the measurements looking now? I hope you got rid of the spikes with this library. Can we close the issue?

JacoFourie commented 6 years ago

Hi. Sorry for the late reply.

I tested it more but still can not get the range correct using the method you stated.

Here is the code I used to test.

HampelFilter tempBuffer = HampelFilter(0 , 11 , 300);

tempBuffer.write(28.77);
  tempBuffer.write(28.75);
  tempBuffer.write(28.76);
  tempBuffer.write(28.78);
  tempBuffer.write(28.77);
  tempBuffer.write(28.75);
  tempBuffer.write(28.80);
  tempBuffer.write(28.71);
  tempBuffer.write(28.72);
  tempBuffer.write(28.73);
  tempBuffer.write(28.74);
  tempBuffer.write(28.75);
  tempBuffer.write(28.76);
  tempBuffer.write(28.77);
  tempBuffer.write(28.78);
  tempBuffer.write(28.79);
  tempBuffer.write(28.77);
  tempBuffer.write(28.78);
  tempBuffer.write(28.77);
  tempBuffer.write(28.76);
  tempBuffer.write(28.75);
  tempBuffer.write(28.74);
  tempBuffer.write(28.73);
  tempBuffer.write(28.72);
  tempBuffer.write(28.73);
  tempBuffer.write(28.74);
  tempBuffer.write(28.79);
  tempBuffer.write(28.80);
  tempBuffer.write(28.60);
  tempBuffer.write(28.74);

void check_if_outlier(String cmd) {
  int cmdLen = cmd.length();
  String value = cmd.substring(2, cmdLen - 2);
  check_value = value.toFloat();  

  Serial.print("Median is " );
  Serial.println(tempBuffer.readMedian());

  if(tempBuffer.readMAD() > 0.05 && tempBuffer.checkIfOutlier(check_value) == true){
     Serial.print("MAD is " );
     Serial.println(tempBuffer.readMAD());
     Serial.print(check_value);
     Serial.println(" : IS a outlier");   
                    //Find a new value from the sensor and add it                 
     }else{
     Serial.print("MAD is " );
     Serial.println(tempBuffer.readMAD()); 
     Serial.print(check_value);
     Serial.println(" : is NOT a outlier");              
     } 

}

and then if I input these values I dont get what I expect to see. The values I enter is more that 5 from the median.

Median is 28.74 MAD is 0.01 28.00 : is NOT a outlier Median is 28.74 MAD is 0.01 35.00 : is NOT a outlier Median is 28.74 MAD is 0.01 25.00 : is NOT a outlier

flrs commented 6 years ago

Alright, I made an error translating my Matlab code into C++ in my previous answer. Try this method (untested):

    Serial.print("Raw Temp value : ");
    Serial.println(temp_c); 

    if (tempBuffer.readMedian() > 0){
        if(abs(tempBuffer.readMedian()-temp_c) > 0.05 && tempBuffer.checkIfOutlier(temp_c) == true){
                Serial.print(temp_c);
                Serial.println(" : IS a outlier");     
        }else{
                Serial.print(temp_c);
                Serial.println(" : is NOT a outlier");
        }                     
    }

    //Add the value to the buffer
    tempBuffer.write(temp_c);
JacoFourie commented 6 years ago

OK so this does what I need it to do. Working with just the median. To see if the value is outside of the band of 5 degrees up or down. I will run 2 running medians and test each sensor with the other to see if it is out of its own band or the other one.

if(abs(tempBuffer.readMedian() - check_value) > 5){     
     Serial.print(check_value);
     Serial.println(" : IS a outlier");   
                    //Find a new value from the sensor and add it                 
     }else{     
     Serial.print(check_value);
     Serial.println(" : is NOT a outlier");              
     }

You can close the issue.

Thanks for the help.

Regards.

flrs commented 6 years ago

Awesome! I am glad we figured that out and you can use (at least part of) the library. All the best with your project!