elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
785 stars 323 forks source link

Cannot find a usable implementation of interface elki.database.ids.DBIDFactory #105

Closed zuoxiang95 closed 1 year ago

zuoxiang95 commented 1 year ago

hello, i am using LOF implementation to detect outlier. My program language is scala. when i use computeScores(), the system prompts an error:

Caused by: elki.utilities.exceptions.AbortException: Cannot find a usable implementation of interface elki.database.ids.DBIDFactory at elki.utilities.ClassGenericsUtil.instantiateLowlevel(ClassGenericsUtil.java:223) at elki.utilities.ClassGenericsUtil.loadDefault(ClassGenericsUtil.java:240) at elki.database.ids.DBIDFactory.(DBIDFactory.java:48) ... 5 more

here is my code:

package com.jd.analysis

import elki.data.`type`.TypeUtil
import elki.outlier.lof.LOF
import elki.datasource.ArrayAdapterDatabaseConnection
import elki.database.StaticArrayDatabase
import elki.database.ids.DBIDRange
import elki.distance.minkowski.EuclideanDistance

import scala.collection.mutable.ListBuffer

case class ElkiLOF(k: Int){
  def computeScores(instances: Array[Array[Double]]): Array[(Int, Double)] = {
    val distance = new EuclideanDistance
    val lof = new LOF(k, distance)

    val dbc = new ArrayAdapterDatabaseConnection(instances)

    // Adapter to load data from an existing array.
    val db = new StaticArrayDatabase(dbc) // Create a database (which may contain multiple relations!)
    db.initialize()

    val rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD)
    val  ids =  rel.getDBIDs().asInstanceOf[DBIDRange]
    val result = lof.autorun(db).getScores

    var scoreList = new ListBuffer[Double]()
    val DBIDs = result.iterDBIDs()
    while ( {
      DBIDs.valid
    }) {
      scoreList += result.doubleValue(DBIDs)
      DBIDs.advance
    }

    val corrected = scoreList.map {
      case d if d.isNaN => 1.0 // Or whatever value you'd prefer.
      case d if d.isNegInfinity => 1.0 // Or whatever value you'd prefer.
      case d if d.isPosInfinity => 1.0 // Or whatever value you'd prefer.
      case d => d
    }
    corrected.toArray.zipWithIndex.map(x => (x._2, x._1))
  }
}

I would be very grateful if you could help me out.

kno10 commented 1 year ago

You classpath is incomplete. Hence it cannot load the class. Make sure to have all the elki modules in the classpath, in particular elki-core-dbids-int. (Hence, the error is not in above code, this is a class load time error.)

zuoxiang95 commented 1 year ago

@kno10 ,thanks for your reply, add the module elki-core-dbids-int, it runs successful. Another question is how do i use index to acceleration the detection.

    val indexfactory = new ELKIBuilder[RStarTreeFactory[_ <: NumberVector]](classOf[RStarTreeFactory[_ <: NumberVector]])
      .`with`(AbstractPageFileFactory.Par.PAGE_SIZE_ID, 512)
      .`with`(RStarTreeFactory.Par.BULK_SPLIT_ID, classOf[SortTileRecursiveBulkSplit]).build

but it cannot find BULK_SPLIT_ID in RStarTreeFactory.Par

kno10 commented 1 year ago

ELKI 0.8.0 should automatically add a suitable index; it will prefer the VP tree which usually is faster; the R*-tree was designed for on-disk operation when main memory was much more scarce than today.

RStarTreeFactory.Par.BULK_SPLIT_ID is supposedly fine, but it is inherited from the abstract class. Maybe Scala does not resolve these constants then?

https://github.com/elki-project/elki/search?q=BULK_SPLIT_ID

You can fall back to passing the string, too.

zuoxiang95 commented 1 year ago

@kno10 Thanks for your kind reply. When I use the algorithm elki.outlier.lof.LOF , and set param k=5. When my dataset has 5 same points, the cluster density will be infinity, and the other point's score will be infinity. Do you have any suggestions?

kno10 commented 1 year ago

There are only four other points (in this implementation, the parameter is defined to not include the point itself) and hence all the reachability distances are infinite (or undefined, but infinite is more useful if you have truncated neighbors).

At the same time, it makes little sense to use a local outlier detection method on such tiny data!