Open kenwenzel opened 2 years ago
I have been thinking about this as well, and I had a variation of option 1 in mind. Which is instead of using the generic java collections use specialized collection implementations handed out by a SAIL.
I started on this with the makeSet method in the EvaluationStrategy. This can be expanded to a method Set<v extends Value) makeValueSet()
, a set can make sure that anything put into it is transformed into a store backed value if possible. Some rough code below to illustrate the idea.
public class LmdbValueSet extends AbstractSet<Value> {
private final ValueStore backingValueStore;
private final org.eclipse.collections.api.set.primitive.LongSet storeKnownValues = new ...
private final HashSet<Values> notKnownToStoreValues = new ...
public boolean add(Value v) {
if (v instanceof LmdbValue lv && lv.getValueStoreRevision().getValueStore() == backingValueStore) {
return storeKnownValues.add(lv.getInternalID());
} else {
long id = backingValueStore.getId(v, false);
if (id == LmdbValue.UNKNOWN_ID) {
return notKnownToStoreValues.add(v);
} else {
return storeKnownValues.add(id);
}
}
return false;
}
//similar delegating logic for all other methods.
}
This set being specific to the LmdbStore can implement much more memory and cpu efficient code. Also these collections may be transparently backed by a disk based store if required.
For the rest a better hashCode algorithm in general would be nice. As the one used provide by String
is not that good in any case.
@JervenBolleman This looks interesting. Have you thought about how this would integrate with custom service functions and/or federation?
@kenwenzel I have only thought about this within the context of query evaluation. But if these makeValueSet/makeSet/makeMap/makeList methods are on the store/sail level then I think that they should be usable for anything that depends on one including the RDFSreasoning/custom functions level.
My original thought was really about the GroupIterator and it use of MapDB, which is a bit silly for your LMDB based implementation.
Some more ideas:
Async IO to produce Values that will be materialized in the future. Async Nio has a nifty way of handling closing of a resource where you can call one method and have it finish complete all the futures. We could the return an iterator from the store and have it materialize everything when it gets closed. Or the inverse of forcing materialization during iteration if we need to close a file handle.
Java 9+ also has a Cleaner class that can be used to close and release things similarly to finalize.
When calling .equals(...) with two not-yet-materialized Values these can use their own logic to realize that they can use their internal values instead of materializing, but this doesn't work for hashCode().
What we could do for hashCode() is to add a localHashCode() method through an interface, and have a wrapper class that maps hashCode() to localHashCode(). The goal is to do new HashSet<LazyValue<NativeValue>>()
. That way we are certain that we only put NativeValue objects in the set, so can always use the localHashCode() method and never materialize the data.
Problem description
The NativeStore and LmdbStore currently use a separate value database that contains mappings from (internal) value IDs to the corresponding data.
Each time when a value instance is required the value database is queried and an object of the corresponding type (IRI, BNode, Literal) is created.
The LmdbStore already has an optimization in that it creates lazy values: The initialization of the data is postponed until it is required, e.g., for
equals()
orhashCode()
. This is possible because the value type is encoded within the internal ID.With this optimization the
equals()
method only requires materialization if an internal LmdbValue and an external value are checked for equality. But thehashCode()
method always requires materialization which is a problem if lazy values are put into a hash map.We mainly have three options to improve this situation:
[1 byte value type + other flags, up to 4 bytes for hash code, up to 3 bytes for sequence number]
Preferred solution
I think that number 2 would be the easiest solution while number 1 would probably be the fastest.
Are you interested in contributing a solution yourself?
Perhaps?
Alternatives you've considered
No response
Anything else?
No response