eclipse-rdf4j / rdf4j

Eclipse RDF4J: scalable RDF for Java
https://rdf4j.org/
BSD 3-Clause "New" or "Revised" License
368 stars 164 forks source link

Compare values without materialization #3797

Open kenwenzel opened 2 years ago

kenwenzel commented 2 years ago

Problem description

The NativeStore and LmdbStore currently use a separate value database that contains mappings from (internal) value IDs to the corresponding data.

Each time when a value instance is required the value database is queried and an object of the corresponding type (IRI, BNode, Literal) is created.

The LmdbStore already has an optimization in that it creates lazy values: The initialization of the data is postponed until it is required, e.g., for equals() or hashCode(). This is possible because the value type is encoded within the internal ID.

With this optimization the equals() method only requires materialization if an internal LmdbValue and an external value are checked for equality. But the hashCode() method always requires materialization which is a problem if lazy values are put into a hash map.

We mainly have three options to improve this situation:

  1. use a global ID service in RDF4J and compare values solely based on those IDs (and also use those IDs for hashCode)
    • It is hard to implement global IDs because different SAILS may use different IDs. Before comparing two values they have to go through the global ID service.
  2. encode the hash codes within the IDs:
    • possible scheme: [1 byte value type + other flags, up to 4 bytes for hash code, up to 3 bytes for sequence number]
    • This approach requires that the hash codes are stable between current and future versions of RDF4J.
    • We can store up to 2^24 = 16777216 for each hash code.
  3. use a persistent mapping table ID -> hash code:
    • This might also be a feasible approach that prevents full materialization but would be slower than 2 as it requires to touch the mapping table (possibly a part of the value store).
    • This approach requires that the hash codes are stable between current and future versions of RDF4J.

Preferred solution

I think that number 2 would be the easiest solution while number 1 would probably be the fastest.

Are you interested in contributing a solution yourself?

Perhaps?

Alternatives you've considered

No response

Anything else?

No response

JervenBolleman commented 2 years ago

I have been thinking about this as well, and I had a variation of option 1 in mind. Which is instead of using the generic java collections use specialized collection implementations handed out by a SAIL.

I started on this with the makeSet method in the EvaluationStrategy. This can be expanded to a method Set<v extends Value) makeValueSet(), a set can make sure that anything put into it is transformed into a store backed value if possible. Some rough code below to illustrate the idea.

public class LmdbValueSet extends AbstractSet<Value> {
      private final ValueStore backingValueStore;
      private final org.eclipse.collections.api.set.primitive.LongSet storeKnownValues = new ...
      private final HashSet<Values> notKnownToStoreValues = new ...

      public boolean add(Value v) {
          if (v instanceof LmdbValue lv && lv.getValueStoreRevision().getValueStore() == backingValueStore) {
              return storeKnownValues.add(lv.getInternalID());
          } else {
              long id = backingValueStore.getId(v, false);
              if (id == LmdbValue.UNKNOWN_ID) {
                  return notKnownToStoreValues.add(v);
              } else {
                  return storeKnownValues.add(id);
              }
          }
          return false;
      }

  //similar delegating logic for all other methods.
}

This set being specific to the LmdbStore can implement much more memory and cpu efficient code. Also these collections may be transparently backed by a disk based store if required.

For the rest a better hashCode algorithm in general would be nice. As the one used provide by String is not that good in any case.

kenwenzel commented 2 years ago

@JervenBolleman This looks interesting. Have you thought about how this would integrate with custom service functions and/or federation?

JervenBolleman commented 2 years ago

@kenwenzel I have only thought about this within the context of query evaluation. But if these makeValueSet/makeSet/makeMap/makeList methods are on the store/sail level then I think that they should be usable for anything that depends on one including the RDFSreasoning/custom functions level.

My original thought was really about the GroupIterator and it use of MapDB, which is a bit silly for your LMDB based implementation.

hmottestad commented 2 years ago

Some more ideas:

Async IO to produce Values that will be materialized in the future. Async Nio has a nifty way of handling closing of a resource where you can call one method and have it finish complete all the futures. We could the return an iterator from the store and have it materialize everything when it gets closed. Or the inverse of forcing materialization during iteration if we need to close a file handle.

Java 9+ also has a Cleaner class that can be used to close and release things similarly to finalize.

When calling .equals(...) with two not-yet-materialized Values these can use their own logic to realize that they can use their internal values instead of materializing, but this doesn't work for hashCode().

What we could do for hashCode() is to add a localHashCode() method through an interface, and have a wrapper class that maps hashCode() to localHashCode(). The goal is to do new HashSet<LazyValue<NativeValue>>(). That way we are certain that we only put NativeValue objects in the set, so can always use the localHashCode() method and never materialize the data.