DK96-OS / MathTools

Mathematical Software Components. This library is actively maintained, and aims to stay updated. New feature proposals are welcome, but may not be included.
Apache License 2.0
2 stars 1 forks source link

Create Statistical Outliers Functions #32

Closed DK96-OS closed 2 years ago

DK96-OS commented 2 years ago

Create static library functions for dealing with outliers.

Define an Outlier Policy that contains the options for outlier removal.

Enable the policy to be extensible so that different techniques for identifying outliers can be utilized.

DK96-OS commented 2 years ago

There are at least three key Outlier methods:

  1. Find and remove at most n outliers from a Mutable List
  2. Identify (by index) the outliers in an unsorted Array
  3. Determine whether the highest element in a sorted Array/List is an outlier

These methods require a definition and technique for determining what an outlier is. This may be specified by the number of Standard Deviations, or a confidence interval.

There may be an option for including/excluding the outlier in the calculations.

Edit: This third method is not important to the DeviationPolicy as one with basic knowledge of statistics can use the DistributionCharacteristics and easily write a line to check the last element.

DK96-OS commented 2 years ago

It may be possible to use generics to combine all of the List types into one method signature. This would be ideal from a user perspective, however performance impact should be considered at some later date.

It is actually not practical to use generics. There are many obstacles and the set of reasonable workarounds have been exhausted.

DK96-OS commented 2 years ago

Failed Tests:

  1. Off by one error in the medium sized list test, an error in the calculation of the limit for the list data.
  2. A NPE in the max value test - expected a limit just below the Double Max Value, but the small subtraction from the MaxValue had no effect.
DK96-OS commented 2 years ago

NumberListType testing function runOnAllLists needs to be extracted to a public function providing object in the test sources directories.

DK96-OS commented 2 years ago

A DeviationPolicy would generally be applied to a specific set of distributions, for which there are a set of expected values.

This set may represent a series of measurements, for which there is a physical (or digital) lower bound, a small expected range of values, and rare large values. The ideal outlier policy is to look for ouliers only much greater than the range of expected values, and ignore lower values, even if they are over 6 Standard Deviations (SD) below the mean.

DK96-OS commented 2 years ago

Should DeviationPolicy maintain an instance of DistributionCharacteristics?

DK96-OS commented 2 years ago

This branch needs to be merged soon. There are important project structure modifications to be made. Anything too time-consuming to resolve, will become a new issue.

DK96-OS commented 2 years ago

The last thing to do before merging: