eclipse-rdf4j / rdf4j

Eclipse RDF4J: scalable RDF for Java
https://rdf4j.org/
BSD 3-Clause "New" or "Revised" License
361 stars 163 forks source link

Optimised collections that know about how Values are implemented by stores #3843

Open JervenBolleman opened 2 years ago

JervenBolleman commented 2 years ago

Problem description

As discussed in #3797 we often need to materialize values to be able to store them in a list. However, we can often do even better, if were able to optimize the collections knowing how the values are implemented. For example we often use a primitive long to identify a value in a store. This means that for a value set we could store these as the primitive value and regenerate them as a Value on demand. This would both avoid materializations as well as improve memory density.

public class LmdbValueSet extends AbstractSet<Value> {
      private final ValueStore backingValueStore;
      private final org.eclipse.collections.api.set.primitive.LongSet storeKnownValues = new ...
      private final HashSet<Values> notKnownToStoreValues = new ...

      public boolean add(Value v) {
          if (v instanceof LmdbValue lv && lv.getValueStoreRevision().getValueStore() == backingValueStore) {
              return storeKnownValues.add(lv.getInternalID());
          } else {
              long id = backingValueStore.getId(v, false);
              if (id == LmdbValue.UNKNOWN_ID) {
                  return notKnownToStoreValues.add(v);
              } else {
                  return storeKnownValues.add(id);
              }
          }
          return false;
      }

  //similar delegating logic for all other methods.
}

This would also allow an improved serialization setup. We almost always fall back to java serializing the string representation of IRIs etc.

@Functionalnterface
public interface ValueToByteArray {
  byte[] toBytes(Value v);
}

We can use the current java serialization one

public JavaSerializerBasedValueToByteArray implements ValueToByteArray {
 public  byte[] toByte(Value v) {
        try (ByteArrayOutputStream boas = new ByteArrayOutputStream()) {
            try (ObjectOutputStream out = new ObjectOutputStream(boas)) {
                out.writeObject(v);
            }
            return boas.toByteArray();
        } catch (IOException e) {
            throw new IllegalStateException(e);
        }
    }
}

With a corresponding ByteArrayToValue interface; but also a pair like

LmdbValueFactory lvf;
(ba) -> {
  byte t = ba[0];
  if (t ==1) {
   return lvf.getLazyValue(getLong(1, ba));
  } else {
   // fallback to seralization
  }
}

This should make the sort and group by code etc. faster

Preferred solution

Have a getCollectionFactory() method with a default implementation on the sail that can provide such a collection on demand.

Are you interested in contributing a solution yourself?

Yes

Alternatives you've considered

Improving the hashcodes which is still worth it but a different problem.

Anything else?

No response

JervenBolleman commented 2 years ago

Actually trying to implement this I realized this is not sufficient. For collections that are backed by disk e.g. mapdb ones, we need to release these resources as soon as possible. This means that these items need to be created and maintained in a context.

JervenBolleman commented 2 years ago

Pushed a nice idea, regarding the slow group by for the lmdb store. Quickly pushed to github before laptop battery dies.