eclipse-rdf4j / rdf4j

Eclipse RDF4J: scalable RDF for Java
https://rdf4j.org/
BSD 3-Clause "New" or "Revised" License
364 stars 164 forks source link

LMDB: Embed literal values into IDs #4774

Open kenwenzel opened 1 year ago

kenwenzel commented 1 year ago

Problem description

The LMDB ValueStore currently assigns every value an ID (64 bits) and stores the corresponding ID -> value and value -> ID mappings. This is overhead for values that may be encoded in less than 64 bits. Possible candidates are:

Preferred solution

Use something in the lines of Jena's TDB2 encodings for literals: https://github.com/apache/jena/tree/main/jena-tdb2/src/main/java/org/apache/jena/tdb2/store/value

Are you interested in contributing a solution yourself?

Perhaps?

Alternatives you've considered

No response

Anything else?

No response

kenwenzel commented 1 year ago

@abrokenjester Is it possible to copy code from Jena which is licensed under Apache License and add this code to RDF4J?

abrokenjester commented 1 year ago

That is a little difficult. Technically the Apache license allows that, provided we make it clear what parts are original and modified, and keep the original copyright headers intact.

However, we don't do this anywhere else in the project, as far as I'm aware, so making sure we set it up correctly will be non trivial. Also not sure what the Eclipse Foundation's view on this is.

I have generally preferred not doing this, just to keep the legal/attribution side of things simple.

If you were to do something like this, I think we need to establish what exactly it should look like in code. The Eclipse RDF4J copyright header would still be at the top of the file, but I think directly beneath that we need another header comment to establish the attribution and the original copyrights. I am not entirely sure what that should look like.

@hmottestad are you aware of any other places in the RDF4J code base where we are doing something like this?

hmottestad commented 1 year ago

I'm not aware that we've copied any code verbatim. We do depend on some Jena stuff for some of the SHACL tests, in order to use the reference implementation.

What code exactly is it you're considering @kenwenzel? Maybe we can take inspiration from it and write something of our own, instead of copying it?

kenwenzel commented 1 year ago

@hmottestad Basically the files contained in the following directory along with some helper classes (BitsInt, BitsLong): https://github.com/apache/jena/tree/main/jena-tdb2/src/main/java/org/apache/jena/tdb2/store/value

I am not sure if it is easier to copy the code than to reimplement it?!

hmottestad commented 1 year ago

I meant to reply to this. I was wondering if we could create a maven module with classes that extend all the ones you want to include, then use java 9 modules to only export those classes.

For the maven side we should exclude basically all dependencies.

Do you think that would work?

reckart commented 11 months ago

Maybe you could shade the jena artifact instead of copying the code? Would still require to mention the source of the classes in the NOTICE file and the license in the LICENSE file of the respective module/JAR though.