Open arne-bdt opened 4 days ago
ARP1 isn't relevant.
- removed httpClient from org.apache.jena.riot.RDFParserBuilder and org.apache.jena.riot.RDFParser, which took quite some time during initialization. --> now org.apache.jena.http.HttpEnv#getDftHttpClient is called from
org.apache.jena.riot.RDFParser#openTypedInputStream
only if needed.HttpEnv
also holds a static reference, so that should be fine.
That is removing a significant capability.
Which part of initialization? General system startup or specific to RDf/XML parsing after JenaSystem.init
has happened?
ARP1 isn't relevant.
I did not change anything for the deprecated ARP-variants.
It was just for reference. (in your email, you measured riot --syntax arp1 --time --count --sink citations.rdf.
)
- removed httpClient from org.apache.jena.riot.RDFParserBuilder and org.apache.jena.riot.RDFParser, which took quite some time during initialization. --> now org.apache.jena.http.HttpEnv#getDftHttpClient is called from
org.apache.jena.riot.RDFParser#openTypedInputStream
only if needed.HttpEnv
also holds a static reference, so that should be fine.That is removing a significant capability.
Which part of initialization? General system startup or specific to RDf/XML parsing after
JenaSystem.init
has happened?
see https://github.com/apache/jena/pull/2744#discussion_r1780045344
Version
5.2.0-SNAPSHOT
Feature
Profiling shows that resolving IRIs takes a lot of time when parsing RDF/XML. (Parsers: RRX.RDFXML_SAX, RRX.RDFXML_StAX_ev, RRX.RDFXML_StAX_sr )
There were two main things, I could do about it:
(org.apache.jena.atlas.lib.cache.CacheSimple
) in the parsers where the already cached org.apache.jena.riot.system.ParserProfileStd#resolver is not applicable --> in some cases this gave me another 10%org.apache.jena.riot.RDFParserBuilder#httpHeader
available again by uncommenting ithttpClient
withHttpEnv.getDftHttpClient()
fromorg.apache.jena.riot.RDFParserBuilder#build
toorg.apache.jena.riot.RDFParser#openTypedInputStream
in the case when theRDFParserBuilder
has not been initialized with a custom httpClient. That way, itHttpEnv
is not necessarily initialized when reading files but only when needed.Some benchmarks with my improvements vs. Jena 5.1.0:
Here benchmarks with citations.rdf and bsbm-5m (converted into rdf/xml):
Are you interested in contributing a solution yourself?
Yes