eclipse-rdf4j / rdf4j

Eclipse RDF4J: scalable RDF for Java
https://rdf4j.org/
BSD 3-Clause "New" or "Revised" License
361 stars 163 forks source link

LMDB: Implement extensible ID scheme #4950

Open kenwenzel opened 5 months ago

kenwenzel commented 5 months ago

Problem description

The LmdbStore uses 64 bit IDs for values. The scheme is fixed and uses the lower two bits to encode the type of the referenced value:

To support RDF-star #3723 and embedded values #4774 a new scheme that is also extensible for future requirements should be developed.

Preferred solution

The following basic scheme could be used:

Inspired by Jena the following detailled encoding can be used:

see also https://github.com/apache/jena/blob/02ecb71c7033dc09cd929474c9884045dfaa9dc1/jena-tdb2/src/main/java/org/apache/jena/tdb2/store/NodeIdType.java#L87

Are you interested in contributing a solution yourself?

Yes

Alternatives you've considered

No response

Anything else?

No response

nguyenm100 commented 3 weeks ago

hi @kenwenzel , curious if this is still on your radar and/or what's left to do here? we're hoping to move to 5.x in a few weeks. tx

kenwenzel commented 1 week ago

@nguyenm100 I have it on my radar. While the ID scheme is finished the type-specific conversion logic (integers, doubles etc.) is missing. This needs some time and careful testing. We also need to find a good way to support something like "0312"^^xsd:int vs. "312"^^xsd:int. Both literals have the same integer value but are not regarded as equal as their labels are different. If we embed such a value then we have to make sure that decoding it would always lead to the correct label. Meaning that "0312"^^xsd:int can't be embedded into an ID while "312"^^xsd:int could be embedded.

hmottestad commented 1 week ago

We do have literal normalisation during RDF parsing BasicParserSettings.NORMALIZE_DATATYPE_VALUES. We don't have this on the sail level for the MemoryStore or the NativeStore, but I know that other triplestores have this feature.

You could make your embedding feature contingent on normalised data. Maybe configurable at the sail level but defaults to normalisation.