JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database
https://janusgraph.org
Other
5.3k stars 1.17k forks source link

JanusGraph SchemaAction.REINDEX does not remove stale data from index backend #354

Open wojciechwojcik opened 7 years ago

wojciechwojcik commented 7 years ago

According to: http://docs.janusgraph.org/0.1.0/indexes.html

When you do:

JanusGraphManagement mgmt = graph.openManagement();
JanusGraphIndex index = mgmt.getGraphIndex("indexName");
mgmt.updateIndex(index, SchemaAction.REINDEX).get();
mgmt.commit();

You'd expect that index data is fully representative of what is currently stored in the graph database. Still the database scanning works only one way: all existing vertices from storage backend are re-added to index backend.

Sometimes when vertex deletion does not get propagated to the index (e.g see #329), index can contain vertices that are no longer in the graph.

In causes the following issues: 1) Queries that use the index will still return such deleted vertices with their ids without performing any checks or logging any errors 2) Reindexing action does not fix this issue

The only workaround is to drop/clear the index manually before re-indexing. This is often time consuming and leaves index non-operational until re-indexing is completely finished.

Reindexing action could be improved to:

Quick and dirty way to re-create: Create mixed index, fill-in some data, drop cassandra storage backend (or some part of if) to simulate failure, restart janus and run some queries using mixed index.

robertdale commented 7 years ago

I think there are a few options to explore for reindexing:

wojciechwojcik commented 7 years ago

Or we could even provide couple of the above options, by adding new enum values to SchemaAction (also leaving current one for backwards compatibility) e.g: SchemaAction.REINDEX, SchemaAction.REINDEX_DROP, SchemaAction.REINDEX_SYNC etc.