JosephTLyons / outliers

A Rust crate used to identify outliers in a data set
GNU General Public License v3.0
1 stars 0 forks source link

Using this lib #14

Closed virtualritz closed 3 years ago

virtualritz commented 3 years ago

I have this sequence:

[
      0.00843769684433937, 
      0.008437675423920155,
      0.03507973998785019,
      0.008437675423920155,
      0.00843769684433937,
      0.035079777240753174,
      0.008437695913016796,
      0.008437679149210453,
      0.03507973626255989,
      0.00843769870698452,
      0.008437675423920155
]

My lower and upper outliers are empty for this sequence. What am I missing here?

JosephTLyons commented 3 years ago

If there are no values in the lower or upper outlier lists, it means that no values in the input list are considered outliers with the current k_value setting (so long as there isn't a bug in the crate). The default k_value setting is 1.5, which is a standard number when calculating outliers, but changing it to another value, using with_k_value() will cause the algorithm to identify either more or less numbers as outliers.


As a sanity check to make sure the crate isn't bugged, I double checked your values using this online outlier calculator and it also identifies no outliers in that set:

CleanShot 2021-04-01 at 17 41 26@2x

Using the default k_vaue, I got the same results as you did:

fn main() {
    let data = [
        0.00843769684433937,
        0.008437675423920155,
        0.03507973998785019,
        0.008437675423920155,
        0.00843769684433937,
        0.035079777240753174,
        0.008437695913016796,
        0.008437679149210453,
        0.03507973626255989,
        0.00843769870698452,
        0.008437675423920155,
    ]
    .to_vec();

    let outlier_identifier = outliers::OutlierIdentifier::new(data, false);
    let results_tuple = outlier_identifier.get_outliers().unwrap();

    println!("Lower outliers");
    println!("==============");

    for number in results_tuple.0 {
        println!("{}", number);
    }

    println!();

    println!("Non-outliers");
    println!("============");

    for number in results_tuple.1 {
        println!("{}", number);
    }

    println!();

    println!("Upper outliers");
    println!("==============");

    for number in results_tuple.2 {
        println!("{}", number);
    }
}
Lower outliers
==============

Non-outliers
============
0.008437675423920155
0.008437675423920155
0.008437675423920155
0.008437679149210453
0.008437695913016796
0.00843769684433937
0.00843769684433937
0.00843769870698452
0.03507973626255989
0.03507973998785019
0.035079777240753174

Upper outliers
==============

Using a lower k_value of 0.1 caused the algorithm to catch some upper outliers in the data set:

// Same code as above, but this with this small line change
let outlier_identifier = outliers::OutlierIdentifier::new(data, false).with_k_value(0.1);
Lower outliers
==============

Non-outliers
============
0.008437675423920155
0.008437675423920155
0.008437675423920155
0.008437679149210453
0.008437695913016796
0.00843769684433937
0.00843769684433937
0.00843769870698452

Upper outliers
==============
0.03507973626255989
0.03507973998785019
0.035079777240753174

Hope this is helpful; let me know if you have any other questions. Also, if you find any bugs while using this crate, please let me know or open a PR :)

virtualritz commented 3 years ago

I do not get it. The outliers in this set are over four times bigger than the rest of the values.

In the sample in the docs the outlier (22) is barely two times bigger than the rest of the values and gets caught. And no special k value was specified.

virtualritz commented 3 years ago

I.e. I thought this crate detects outliers by how much different they are. Regardless of what scale the values have in total.

virtualritz commented 3 years ago

I looked at the source code. The problem is the sorting. When you sort the values the three upper outliers are clustered at the end and are not outliers any more since they are very close. When the data is unsorted they stick out easily (as you can see with your own eyes when scanning the original set).

JosephTLyons commented 3 years ago

This crate implements one of the most common statistical operations for finding outliers. Using the median of the data set, it calculates the interquartile range, which it uses to define "fences" in the data set. When data points lie outside of those "fences," they are considered outliers. The value of 1.5 is the number that the famous mathematician, John Tukey, used for defining his "fences" in this outlier detection algorithm.

This crate was designed to use this commonly accepted way of detecting outliers, so unfortunately, if you are wanting something less typical, maybe something that identifies outliers based on the average of the data set, rather than the median, then you will likely have to look elsewhere for now for now (I plan to add other types of outlier detection in the future - https://github.com/JosephTLyons/outliers/issues/1).