JanusGraph SchemaAction.REINDEX does not remove stale data from index backend

wojciechwojcik commented 7 years ago

According to: http://docs.janusgraph.org/0.1.0/indexes.html

When you do:

JanusGraphManagement mgmt = graph.openManagement();
JanusGraphIndex index = mgmt.getGraphIndex("indexName");
mgmt.updateIndex(index, SchemaAction.REINDEX).get();
mgmt.commit();

You'd expect that index data is fully representative of what is currently stored in the graph database. Still the database scanning works only one way: all existing vertices from storage backend are re-added to index backend.

Sometimes when vertex deletion does not get propagated to the index (e.g see #329), index can contain vertices that are no longer in the graph.

In causes the following issues: 1) Queries that use the index will still return such deleted vertices with their ids without performing any checks or logging any errors 2) Reindexing action does not fix this issue

The only workaround is to drop/clear the index manually before re-indexing. This is often time consuming and leaves index non-operational until re-indexing is completely finished.

Reindexing action could be improved to:

incrementally query the index backend and check the existence of vertex ids in the graph
remove stale data from the index

Quick and dirty way to re-create: Create mixed index, fill-in some data, drop cassandra storage backend (or some part of if) to simulate failure, restart janus and run some queries using mixed index.

robertdale commented 7 years ago

I think there are a few options to explore for reindexing:

iterate through each item in the index, add/update/delete as you suggest
drop, create, reindex
create tmp index, reindex, drop active index, rename tmp index

wojciechwojcik commented 7 years ago

Or we could even provide couple of the above options, by adding new enum values to SchemaAction (also leaving current one for backwards compatibility) e.g: SchemaAction.REINDEX, SchemaAction.REINDEX_DROP, SchemaAction.REINDEX_SYNC etc.

JanusGraph / janusgraph

JanusGraph SchemaAction.REINDEX does not remove stale data from index backend #354