apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.12k stars 653 forks source link

NodeValue needlessly materializes lexical forms of non-XSD datatypes #1801

Closed Aklakan closed 1 year ago

Aklakan commented 1 year ago

Change

NodeValue's _setByValue method only handles xsd datatypes however it eagely materializes the lexical form even of non-xsd namespace'd datatypes. This introduces a noticeable performance overhead when dealing with datatype extensions such as geometries or json objects which are only used as intermediary values. With my current workload of many small json objects it is around 5-10%.

NodeValue itself bears the following comment

  1. Conversely, delaying turning a value into a graph node is valuable because intermediates, like the result of 2+3, will not be needed as nodes unless assignment (and there is no assignment in SPARQL even if there is for ARQ). Node level operations like str() don't need a full node.

The simple solution is to defer materialization of the lexical form after having ensured the given Node has a datatype in the xsd namespace.

As a question, I wonder if it is really necessary for _setByValue to always go via the lexical form for all XSD types, or whether as a future improvement it would be possible to reuse the LiteralLabel's Java object.

Profile without enhancement: image

Profile with enhancement. Note, that JsonWriter.string() no longer appears: image

Are you interested in contributing a pull request for this task?

Yes

afs commented 1 year ago

The PR looks OK.

Another approach for extension datatypes is to implement NodeValue.

As to generally using the Node object value, they serve different purposes.

The Node value is for the API which makes it very rigid to ensure Model API compatibility.

IMO The best change would be to not keep a value with the Node at all and do mapping to/from Java in the API code.

Aklakan commented 1 year ago

I also implemented a NodeValueJson class in addition to the RDFDatatypeJson exactly to delay going to Node for as long as possible during SPARQL evaluation. However, in the evaluation Node ande NodeValue are converted back and forth when evaluating expressions and placing the results back into bindings. Not sure how that could be handled efficiently if Node no longer held a value - unless it held the value indirectly via NodeOverNodeValue (which extends Node).

afs commented 1 year ago

A different literal label implementation?

There are other cases for carrying information around with Nodes generally -- TDB NodeIds for example.