biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.75k stars 995 forks source link

Add GESD to outliers #6646

Closed belg4mit closed 8 months ago

belg4mit commented 9 months ago

Please add support for Generalized ESD, a robust and powerful unsupervised algorithm for identification of outliers. This algorithm is easy to understand but seemingly underused because few tools make it readily available.

markotoplak commented 8 months ago

If I understand correctly, GESD is a univariate technique. In contrast, Orange's Outliers widget is, like most widgets in Orange, multivariate. So, if we put GESD into the Outliers widget, the user would also need to select a specific feature being measured. If that was not the case, all the nice statistical properties of GESD would go out of the window.

This makes the technique quite limiting. I understand it can be useful, and it is easy to understand, but do you have an example where that specific technique would yield much more useful results than any of the others from the Outliers widget?

We can't put everything in Orange - both because (1) we don't have resources and (2) having too much would clutter the user interface. Therefore, I'd suggest that if anyone has a wish to implement it, they can do it in the Prototypes add-on, and if the technique proves generally useful we can then merge it into the main repository.

belg4mit commented 8 months ago

Yes, an example would be processing a data set such as this Air-Source Heat Pump Residential Projects Database I would like to use GESD on the dependent variable to throw out projects with abnormally high or low costs, rather than having to throw away a fixed percentage. The other parameters (capacity, efficiency) are all easy to verify that they fall within reasonable ranges given domain expertise. It's possible that multivariate outlier detection could pick-up on "unreasonable" costs for projects of a certain capacity or efficiency, but since this data is constrained to cover things within a particular range of capacities and efficiencies that's not the most vital path for cleaning.

I'm sure there are other instances where this could be useful beyond this example, which may seem overly narrow, I'm simply trying to fully outline the use case I envisaged: trim records based on target column before feeding the data on. Later step such as learners could still use their default cleaning, or one could explicitly apply other Preprocessing as desired.