elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
780 stars 321 forks source link

Re-run of DeLiClu causes Exception #45

Closed neilireson closed 6 years ago

neilireson commented 6 years ago

There seems to be a bug in the implementation which means multiple runs of DeLiClu causes an Exception "DeLiClu heap was empty when it shouldn't have been.".

This does not happen if the index is rebuild at each iteration, and there is no issue using other OPTICS Algorithms.

for (int minPnts : new int[]{5, 10, 15, 20}) {
    if (rebuildIndex) {
            db = new StaticArrayDatabase(dbc, Collections.singletonList(indexFactory));
            db.initialize();
            relations = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
    }

    clustering = new ELKIBuilder<>(OPTICSXi.class) //
       .with(DeLiClu.Parameterizer.MINPTS_ID, minPoints) //
       .with(OPTICSXi.Parameterizer.XI_ID, xi) //
       .with(OPTICSXi.Parameterizer.XIALG_ID, DeLiClu.class) //
       .build().run(db);
}

It seems that running DeLiClu may be altering the data index. I'm unsure if it's related but using DeLiClu I can also get an ObjectNotFoundException with the following...

  for (Cluster<? extends Model> cluster : clustering.getAllClusters()) {
      for (DBIDIter it = cluster.getIDs().iter(); it.valid(); it.advance()) {
            try {
                double[] latlng = relations.get(it).toArray();
            }
            catch(ObjectNotFoundException e) {
                logger.error(e.getLocalizedMessage());
            }
        }
  }
kno10 commented 6 years ago

Yes. DeLiClu indexes are not reusable. It's the way the algorithm is designed; it would likely be best to push the index construction into the DeLiClu algorithm invocation (so it is easier to configure, too), but that was not possible back when DeLiClu was first added: indexed databases were literally special databases back then.

For the second issue, if you reinitialize the database, then you will get new ids each time. The first run will likely have ids 0...n-1, the second one n..2n-1 etc. If you then use the relation of a different database (and thus, relation), the ids will not be found. Possible ways out: A) don't rebuild the database, but only remove and re-build the index. B) keep the correct relation for each result, C) use fixed DBIDs, D) map results to your own persistent identifiers.

neilireson commented 6 years ago

Thanks for the amazingly speedy reply.

For the first issue: Just to confirm... is this only a special case with DeLiClu, am I OK to re-run the DBSCAN and OPTICS algorithms on the same index, multiple times, with different parameters, without any fear of "cross-contamination" between the runs?

For the second issue. The Exception happens on a fresh build of the index. So it does seem to be possible that running DeLiClu is removing the objects/references or adding new ones.

kno10 commented 6 years ago

Yes, DeLiClu is special.

I can't debug that for you. But that is my guess, using DBIDs on a different relation. You overwrite variables in your code.

kno10 commented 6 years ago

With commit 4b428f812c9bb65e750ba71d8b5685e3ca4c8106 the DeLiClu index will be created by DeLiClu itself (so you no longer need to set it up before using DeLiClu), and it will not be attached to the database as a regular index. Then running DeLiClu multiple times should be a bit easier.