Filtering out outliers in data

behnam-zakeri commented 2 years ago

It would be nice to have a feature/method to exclude some outliers in timeseries data. This can be done, for example, as a new option under pyam.IamDataFrame().validate(). Method of calculation can be done either:

Based on Z score (a multiplier of standard deviation (SD)), for example, 3 times SD: iam = iam.validate({"Price|Carbon": {"outlier": "3SD"}}, exclude_on_fail=True) The way this is calculated in python can be as follows (df is pandas.DataFrame): df = df[(df - df.mean()).abs() <= (3 * df.std())]
Based on percentile explicitly: iam = iam.validate({"Price|Carbon": {"outlier": "[0.03, 0.98]"}}, exclude_on_fail=True) There are some suggestions on how to do this here: https://stackoverflow.com/questions/35827863/remove-outliers-in-pandas-dataframe-using-percentiles

danielhuppmann commented 2 years ago

Sounds like a great idea, thanks @behnam-zakeri!

One minor comment: I would start this as a new method, maybe df.validate_outliers()...

danielhuppmann commented 1 year ago

For future reference: #715 added a new require_data() method and #686 added a compute.quantile() method. These two methods could be useful starting points for implementing this feature.

IAMconsortium / pyam

Filtering out outliers in data #629