apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.12k stars 652 forks source link

Sub-par concurrent read performance with jena-iri #1470

Open Aklakan opened 2 years ago

Aklakan commented 2 years ago

Version

4.6.0-SNAPSHOT

What happened?

I started again looking into the issues I had with Jena in Spark settings; related to https://issues.apache.org/jira/browse/JENA-2309

Right now I am investigating some long standing performance issues where concurrent processing time does not scale directly with the number of cores. Concretely, I am comparing our spark+jena4-based tarql re-implementation with original tarql (jena2).

One culprit is the jena-iri package which uses synchronized singleton lexers which introduce locking overhead between the worker threads. A quick fix is to make those lexers thread-local which reduces the overhead. On my notebook in power save and performance mode I get these improvements:

jena-4.6.0-SNAPSHOT: power save: 68 sec performance: 21 sec

thread-local-fix: power save: 54 sec performance: 19sec

Profiler output (relevant column is the number of waits): image

A related issue I am currently investigating is that a lot of time is spent in the IRI parsing machinery e.g. via E_IRI. For testing I changed it to return the argument as given which reduced the total processing time (in performance mode) from 19 to 13 seconds - so around 30% - time that is predominantly spent in the jena-iri lexers. I am not yet sure however if there is anything that can be optimized without compromising functionality though.

Are you interested in making a pull request?

Yes

afs commented 2 years ago

Are you calling jena-iri directly?

1/ (repeated from JENA-2309) IRIx is an abstraction layer for replaceable IRI implementations.

One such IRI3986 implementation is https://github.com/afs/x4ld/tree/main/iri4ld . Minimal object creation - one object to record the results per parser call and RFC3986.create is thread-safe.

Other implementations can be plugged in.

2/ The parser pipeline uses a cache to avoid duplicate work: that changes IRI processing from being the significant cost to not the primary cost when parsing on a single thread.

https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/riot/system/FactoryRDFCaching.java#L62 which incidentally has the benefit of reducing memory footprint (IIRC by about a 1/3). Maybe that works in E_IRI.

FYI: https://github.com/tarql/tarql/pull/99 upgrades tarql to Apache Jena 4.5.0

Aklakan commented 2 years ago

Adding a cache to E_IRI/IRIx should be simple and I can check how much this improves.

How does the iri4ld implementation differ from jena's current default one functionality-wise? In any case, having less (needless) synchronization between threads is always better.

FYI: https://github.com/tarql/tarql/pull/99 upgrades tarql to Apache Jena 4.5.0

Good to know that its possible to compare performance of spark-based tarql to original tarql within jena4! :) Especially because then the same IRI machinery is used.

In addition, I noticed that E_BNode also causes waits due to synchronization in a SecureRandom instance. This is probably better handled as a separate issue but for now I just wanted to document it here. My spark job's runtime (using a test mapping without iri()) jumps from ~4.5 to ~10 seconds only by adding a dummy bnode() call:

CONSTRUCT { <urn:example:s> <urn:example:p> ?a, ?b, ?c } # ... 16 columns in total
FROM <file:data.csv>
WHERE { BIND(bnode(?a) AS ?foobar) }

The same job with tarql/jena2 executes somewhere between 50-60 sec where with bnode it seems to tend more towards 60sec - so in single thread processing the effect is less visible. It seems that threads competing for the bnode call is also a bottleneck.

afs commented 2 years ago

How does the iri4ld implementation differ from jena's current default one functionality-wise?

Javadoc has the operations described: https://github.com/afs/x4ld/blob/main/iri4ld/src/main/java/org/seaborne/rfc3986/RFC3986.java

An Jena IRIProvider: https://gist.github.com/afs/a0bf740d1bd1fde283eabeab8b4ddb67

It is a java-coded parser for RFC 3986. The parser is a single file (IRI3986), written with efficiency in-mind. No sub-parsers or tokenizers.

jena-iri is a general system for IRIs. It is complicated to build.

iri4ld simple to build and provides the operations needed for linked data. Like jena-iri, it is independent of the Jena RDF codebase. iri4ld has less in the the way of extras not used by Jena.

The parser is IRI3986.java - all URIs (except it works in Java unicode strings so RFC 3987).

It has some additional scheme specific rule support for the common schemes: it covers "http:", "https:", "did:", "file:" "urn:uuid:", "urn:", "uuid:" (which is not official) and "example:" (RFC 7595).

afs commented 2 years ago

The parsers generate blank nodes by allocating a UUID once at the start of a parser run, then xor'ing the label into the random number. Unlabelled blank nodes get a not-writable label (it has a 0 byte in it) allocated from a counter.

afs commented 2 years ago

IRIx is not the place to put a cache. IRIx is general IRI machinery for any purpose.

The session is provided by an FactoryRDF (FactoryRDFCaching extends FactoryRDFStd implements FactoryRDF). The cache is then of NodeURIs.