Make ValueStore more resilient against data corruption

ebner commented 1 week ago

Problem description

I am currently facing a situation where a repository somehow got corrupted (still investigating) and where I am unable to process the uncorrupted data due to an unhandled error situation in ValueStore. When iterating over all triples in the repo, some statement cannot be loaded properly and that causes the whole RepositoryConnection to be terminated prematurely. The stack trace is as follows:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
    at org.eclipse.rdf4j.sail.nativerdf.ValueStore.data2value(ValueStore.java:529)
    at org.eclipse.rdf4j.sail.nativerdf.ValueStore.getValue(ValueStore.java:192)
    at org.eclipse.rdf4j.sail.nativerdf.NativeStatementIterator.getNextElement(NativeStatementIterator.java:70)
    at org.eclipse.rdf4j.sail.nativerdf.NativeStatementIterator.getNextElement(NativeStatementIterator.java:28)
    at org.eclipse.rdf4j.common.iteration.LookAheadIteration.lookAhead(LookAheadIteration.java:78)
    at org.eclipse.rdf4j.common.iteration.LookAheadIteration.hasNext(LookAheadIteration.java:53)
    at org.eclipse.rdf4j.common.iteration.DualUnionIteration.lookaheadWithoutOrder(DualUnionIteration.java:176)
    at org.eclipse.rdf4j.common.iteration.DualUnionIteration.lookAhead(DualUnionIteration.java:119)
    at org.eclipse.rdf4j.common.iteration.DualUnionIteration.hasNext(DualUnionIteration.java:93)
    at org.eclipse.rdf4j.common.iteration.IterationWrapper.hasNext(IterationWrapper.java:67)
    at org.eclipse.rdf4j.sail.base.SailClosingIteration.hasNext(SailClosingIteration.java:83)
    at org.eclipse.rdf4j.common.iteration.IterationWrapper.hasNext(IterationWrapper.java:67)
    at org.eclipse.rdf4j.sail.helpers.SailBaseIteration.hasNext(SailBaseIteration.java:43)
    at org.eclipse.rdf4j.sail.helpers.CleanerIteration.hasNext(CleanerIteration.java:42)
    at org.eclipse.rdf4j.repository.sail.SailCloseableIteration.hasNext(SailCloseableIteration.java:40)
    at org.eclipse.rdf4j.repository.RepositoryResult.hasNext(RepositoryResult.java:58)

The problem here is that the first line in ValueStore.data2value expects a non-empty array: switch (data[0]) {

Since the data for some reason could not be loaded the data array is not properly initialized. The following ArrayIndexOutOfBoundsException is not caught and propagates all the way up to the RepositoryConnection which breaks.

Preferred solution

Add parameter checks to ValueStore.data2value and log an error if a statement cannot be read. This way uncorrupted data of a repo can still be read. With the current situation nothing can be done with the repo, the uncorrupted statements cannot be exported/serialized and the data needs to be restored from a backup. The latter is of course the preferred solution for corrupted repos, but in case of a silent corruption that has been going on for some time the uncorrupted repo may be out of the backup cycle.

Are you interested in contributing a solution yourself?

Perhaps?

Alternatives you've considered

No response

Anything else?

No response

kenwenzel commented 1 week ago

We are also facing the same problem. The best solution is trying to prevent the corruption by using setForceSync(true). If we have a corrupted value store then the question is what kind of resource or literal should be returned if the value is either corrupted or not available at all as in the end we need a valid Value object. Maybe we could introduce special values for UNKNOWN and CORRUPTED.

ebner commented 1 week ago

Usage of setForceSync(true) is a great tip, have you experienced any performance penalty through that?

I solved my problem with a hacky modification of ValueStore.getValue(int id):

if (data != null) {
    if (data.length == 0 ) {
        resultValue = new NativeLiteral(revision, "CORRUPT");
    } else {
        resultValue = data2value(id, data);
    }

    // Store value in cache
    valueCache.put(cacheID, resultValue);
}

This allowed me to export the whole repo and search for any corrupted values.

I'm willing to contribute a fix that solves the problem in a more elegant way and that can also be merged upstream - I'm thinking of using a RepositoryException when a value cannot be read, this way such an error can be caught e.g. during iteration. It would be possible to continue to the next statement, without having the whole RepositoryConnection failing.

This would of course mean that the whole statement is "lost" and not only the value that couldn't be read. Your idea of introducing a special CORRUPTED value sounds interesting, but it would require that all statements are always checked for such a special value. This would add complexity to statement handling with probably little benefit (a lost value is almost as bad as a whole lost statement, unless all statements with corrupted values are post-processed manually to attempt a data restore), that's why I'm currently in favor of a "quick fail" by a RepositoryException.

hmottestad commented 1 week ago

Could also be a config option, something like a system property org.eclipse.rdf4j.nativestore.softFailOnCorruptData.

I do like the idea of having the exception be catchable within an interation so that you can continue on to the next statement.

Documentation would also be useful.

I've heard btw that it's possible to corrupt the ValueStore by inserting a particular Literal. We don't know which literal is causing the corruption, or why. All I know is that someone had their NativeStore become corrupt twice, without any power outage.

eclipse-rdf4j / rdf4j