How to fix: No 'by label' reference outlier found, which is needed for weighting!

elki-project / elki

ELKI Data Mining Toolkit

https://elki-project.github.io/

GNU Affero General Public License v3.0

785 stars 323 forks source link

How to fix: No 'by label' reference outlier found, which is needed for weighting! #70

Closed CheyenneForbes closed 4 years ago

CheyenneForbes commented 4 years ago

I'm trying to visualize a rtree but I am getting an error:

Task failed
de.lmu.ifi.dbs.elki.utilities.exceptions.AbortException: No 'by label' reference outlier found, which is needed for weighting!
    at de.lmu.ifi.dbs.elki.application.greedyensemble.VisualizePairwiseGainMatrix.run(VisualizePairwiseGainMatrix.java:140)
    at de.lmu.ifi.dbs.elki.gui.minigui.MiniGUI$2.doInBackground(MiniGUI.java:600)
    at de.lmu.ifi.dbs.elki.gui.minigui.MiniGUI$2.doInBackground(MiniGUI.java:591)
    at javax.swing.SwingWorker$1.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at javax.swing.SwingWorker.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

I tried adding a field called bylabel

kno10 commented 4 years ago

"by label" outlier does not refer to an attribute name, but to a result type - this class requires a label-based reference to compute the gain as defined in the corresponding paper. The VisualizePairWiseGainMatrix class does not visualize an rtree; instead you want to use the default KDDCLIApplication and the NullAlgorithm if you just want to visualize your data (and index). The index visualization can then be enabled in the menus.

Neither integer nor real are proper arff types - see the arff format documentation of Weka. The id must not be numeric, or you need to set up -arff.externalid to match the id column - otherwise, it will be used as part of your data! With the parameter -arff.classlabel you can select your outlier column as class label for evaluation.

An R-Tree with this page size does not make any sense! All your data will be in a single page, and you get 0 benefit, only overhead, from the index.

CheyenneForbes commented 4 years ago

Thank you, how can I make the maximum of directory and leaf nodes be the same number? for my test I want the max of both to be 4.

kno10 commented 4 years ago

Not very easily. Page size is chosen in bytes, and internal nodes require about twice as much memory per entry, because they need to store bounding boxes. Hence you must expect leaf nodes to have almost double the capacity as internal nodes if you store point data. In a realistic R-tree setting, you'll be controlling the page size in bytes (set to a value such as 8192 that corresponds to the size of a block on the harddisk), not the number of entries.