Open Aklakan opened 2 years ago
Are you calling jena-iri directly?
1/ (repeated from JENA-2309) IRIx is an abstraction layer for replaceable IRI implementations.
One such IRI3986 implementation is https://github.com/afs/x4ld/tree/main/iri4ld .
Minimal object creation - one object to record the results per parser call and RFC3986.create
is thread-safe.
Other implementations can be plugged in.
2/ The parser pipeline uses a cache to avoid duplicate work: that changes IRI processing from being the significant cost to not the primary cost when parsing on a single thread.
https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/riot/system/FactoryRDFCaching.java#L62
which incidentally has the benefit of reducing memory footprint (IIRC by about a 1/3). Maybe that works in E_IRI
.
FYI: https://github.com/tarql/tarql/pull/99 upgrades tarql to Apache Jena 4.5.0
Adding a cache to E_IRI/IRIx should be simple and I can check how much this improves.
How does the iri4ld implementation differ from jena's current default one functionality-wise? In any case, having less (needless) synchronization between threads is always better.
FYI: https://github.com/tarql/tarql/pull/99 upgrades tarql to Apache Jena 4.5.0
Good to know that its possible to compare performance of spark-based tarql to original tarql within jena4! :) Especially because then the same IRI machinery is used.
In addition, I noticed that E_BNode also causes waits due to synchronization in a SecureRandom instance. This is probably better handled as a separate issue but for now I just wanted to document it here. My spark job's runtime (using a test mapping without iri()) jumps from ~4.5 to ~10 seconds only by adding a dummy bnode() call:
CONSTRUCT { <urn:example:s> <urn:example:p> ?a, ?b, ?c } # ... 16 columns in total
FROM <file:data.csv>
WHERE { BIND(bnode(?a) AS ?foobar) }
The same job with tarql/jena2 executes somewhere between 50-60 sec where with bnode it seems to tend more towards 60sec - so in single thread processing the effect is less visible. It seems that threads competing for the bnode call is also a bottleneck.
How does the iri4ld implementation differ from jena's current default one functionality-wise?
Javadoc has the operations described: https://github.com/afs/x4ld/blob/main/iri4ld/src/main/java/org/seaborne/rfc3986/RFC3986.java
An Jena IRIProvider: https://gist.github.com/afs/a0bf740d1bd1fde283eabeab8b4ddb67
It is a java-coded parser for RFC 3986. The parser is a single file (IRI3986
), written with efficiency in-mind. No sub-parsers or tokenizers.
jena-iri is a general system for IRIs. It is complicated to build.
iri4ld simple to build and provides the operations needed for linked data. Like jena-iri, it is independent of the Jena RDF codebase. iri4ld has less in the the way of extras not used by Jena.
The parser is IRI3986.java - all URIs (except it works in Java unicode strings so RFC 3987).
It has some additional scheme specific rule support for the common schemes: it covers "http:", "https:", "did:", "file:" "urn:uuid:", "urn:", "uuid:" (which is not official) and "example:" (RFC 7595).
The parsers generate blank nodes by allocating a UUID once at the start of a parser run, then xor'ing the label into the random number. Unlabelled blank nodes get a not-writable label (it has a 0 byte in it) allocated from a counter.
IRIx is not the place to put a cache. IRIx is general IRI machinery for any purpose.
The session is provided by an FactoryRDF (FactoryRDFCaching extends FactoryRDFStd implements FactoryRDF). The cache is then of NodeURIs.
Version
4.6.0-SNAPSHOT
What happened?
I started again looking into the issues I had with Jena in Spark settings; related to https://issues.apache.org/jira/browse/JENA-2309
Right now I am investigating some long standing performance issues where concurrent processing time does not scale directly with the number of cores. Concretely, I am comparing our spark+jena4-based tarql re-implementation with original tarql (jena2).
One culprit is the jena-iri package which uses synchronized singleton lexers which introduce locking overhead between the worker threads. A quick fix is to make those lexers thread-local which reduces the overhead. On my notebook in power save and performance mode I get these improvements:
jena-4.6.0-SNAPSHOT: power save: 68 sec performance: 21 sec
thread-local-fix: power save: 54 sec performance: 19sec
Profiler output (relevant column is the number of waits):
A related issue I am currently investigating is that a lot of time is spent in the IRI parsing machinery e.g. via E_IRI. For testing I changed it to return the argument as given which reduced the total processing time (in performance mode) from 19 to 13 seconds - so around 30% - time that is predominantly spent in the jena-iri lexers. I am not yet sure however if there is anything that can be optimized without compromising functionality though.
Are you interested in making a pull request?
Yes