elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
785 stars 323 forks source link

Source code of KMeansMinusMinusOutierDetection #94

Closed BraulioSanchez closed 2 years ago

BraulioSanchez commented 3 years ago

Source code of KMeansMinusMinusOutierDetection for identifying noise points as outliers, noise flag removed from KMeansMinusMinus to keep the original publication proposal.

codecov[bot] commented 3 years ago

Codecov Report

Merging #94 (255127c) into master (2af96da) will decrease coverage by 1.00%. The diff coverage is 96.87%.

:exclamation: Current head 255127c differs from pull request most recent head b83dde7. Consider uploading reports for the commit b83dde7 to get more accurate results

@@             Coverage Diff              @@
##             master      #94      +/-   ##
============================================
- Coverage     51.88%   50.88%   -1.01%     
+ Complexity    12563    12110     -453     
============================================
  Files          1808     1727      -81     
  Lines         90538    86216    -4322     
  Branches      16726    15869     -857     
============================================
- Hits          46977    43870    -3107     
+ Misses        39172    38225     -947     
+ Partials       4389     4121     -268     
Impacted Files Coverage Δ
.../java/elki/clustering/kmeans/KMeansMinusMinus.java 91.66% <83.33%> (+3.00%) :arrow_up:
...r/clustering/KMeansMinusMinusOutlierDetection.java 100.00% <100.00%> (ø)
.../utilities/datastructures/iterator/FilteredIt.java 0.00% <0.00%> (-65.00%) :arrow_down:
...rc/main/java/elki/data/model/CoreObjectsModel.java 0.00% <0.00%> (-40.00%) :arrow_down:
...ionhandling/parameterization/Parameterization.java 58.33% <0.00%> (-33.34%) :arrow_down:
...lities/datastructures/unionfind/UnionFindUtil.java 0.00% <0.00%> (-33.34%) :arrow_down:
...asource/filter/AbstractStreamConversionFilter.java 72.72% <0.00%> (-22.73%) :arrow_down:
...java/elki/database/ids/integer/IntegerDBIDVar.java 29.16% <0.00%> (-14.59%) :arrow_down:
...ical/extraction/SimplifiedHierarchyExtraction.java 77.30% <0.00%> (-10.63%) :arrow_down:
...i/utilities/optionhandling/ParameterException.java 63.15% <0.00%> (-10.53%) :arrow_down:
... and 219 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

kno10 commented 3 years ago

I'd rather keep the noise flag option, but make the default behavior follow the original publication. For many clustering evaluation cases, it will be necessary to assign them to the nearest cluster.

kno10 commented 3 years ago

As for the current code, I don't know if we shouldn't solve this differently: Right now, the code produces a binary outlier label, which is effectively 1 exactly if objects are in a noise cluster. We could write a "NoiseAsOutliers" class that would work both with k-means-- as well as DBSCAN and perform this transformation. But it would likely be more in line with k-means-- – which ranks objects by the distance to the nearest cluster centers – to produce a score based on the distance to the cluster center, i.e. use KMeansOutlierDetection with k-means-- and assign "noise" points to the nearest cluster (i.e., without kmeansmm.noisecluster). Which also allows comparing regular k-means and k-means-- consistently.