apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.1k stars 647 forks source link

Faster parsing of RDF/XML by avoiding duplicated resolving of IRIs and adding cache for IRIx in parsers #2740

Open arne-bdt opened 4 days ago

arne-bdt commented 4 days ago

Version

5.2.0-SNAPSHOT

Feature

Profiling shows that resolving IRIs takes a lot of time when parsing RDF/XML. (Parsers: RRX.RDFXML_SAX, RRX.RDFXML_StAX_ev, RRX.RDFXML_StAX_sr )

There were two main things, I could do about it:

Some benchmarks with my improvements vs. Jena 5.1.0:

Benchmark                                                       (param0_GraphUri)  (param1_ParserLang)  Mode  Cnt  Score   Error  Units
TestXMLParser.parseXML          CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml       RRX.RDFXML_SAX  avgt   15  0,947 ± 0,042   s/op
TestXMLParser.parseXML          CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml   RRX.RDFXML_StAX_ev  avgt   15  1,326 ± 0,020   s/op
TestXMLParser.parseXML          CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml   RRX.RDFXML_StAX_sr  avgt   15  0,945 ± 0,012   s/op
TestXMLParser.parseXML          CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml      RRX.RDFXML_ARP1  avgt   15  2,359 ± 0,029   s/op
TestXMLParser.parseXML         CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml       RRX.RDFXML_SAX  avgt   15  0,148 ± 0,012   s/op
TestXMLParser.parseXML         CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml   RRX.RDFXML_StAX_ev  avgt   15  0,182 ± 0,006   s/op
TestXMLParser.parseXML         CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml   RRX.RDFXML_StAX_sr  avgt   15  0,144 ± 0,005   s/op
TestXMLParser.parseXML         CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml      RRX.RDFXML_ARP1  avgt   15  0,296 ± 0,006   s/op
TestXMLParser.parseXMLJena510   CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml       RRX.RDFXML_SAX  avgt   15  1,335 ± 0,015   s/op
TestXMLParser.parseXMLJena510   CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml   RRX.RDFXML_StAX_ev  avgt   15  1,708 ± 0,014   s/op
TestXMLParser.parseXMLJena510   CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml   RRX.RDFXML_StAX_sr  avgt   15  1,342 ± 0,012   s/op
TestXMLParser.parseXMLJena510   CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml      RRX.RDFXML_ARP1  avgt   15  2,320 ± 0,047   s/op
TestXMLParser.parseXMLJena510  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml       RRX.RDFXML_SAX  avgt   15  0,187 ± 0,004   s/op
TestXMLParser.parseXMLJena510  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml   RRX.RDFXML_StAX_ev  avgt   15  0,227 ± 0,004   s/op
TestXMLParser.parseXMLJena510  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml   RRX.RDFXML_StAX_sr  avgt   15  0,191 ± 0,006   s/op
TestXMLParser.parseXMLJena510  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml      RRX.RDFXML_ARP1  avgt   15  0,298 ± 0,006   s/op

Here benchmarks with citations.rdf and bsbm-5m (converted into rdf/xml):

Benchmark                      (param0_GraphUri)  (param1_ParserLang)  Mode  Cnt   Score    Error  Units
TestXMLParser.parseXML               bsbm-5m.xml       RRX.RDFXML_SAX  avgt    3   8,943 ±  1,683   s/op
TestXMLParser.parseXML               bsbm-5m.xml   RRX.RDFXML_StAX_sr  avgt    3   8,751 ±  1,101   s/op
TestXMLParser.parseXML               bsbm-5m.xml      RRX.RDFXML_ARP1  avgt    3  19,233 ±  4,357   s/op
TestXMLParser.parseXML             citations.rdf       RRX.RDFXML_SAX  avgt    3  47,618 ± 15,676   s/op
TestXMLParser.parseXML             citations.rdf   RRX.RDFXML_StAX_sr  avgt    3  46,390 ± 12,690   s/op
TestXMLParser.parseXML             citations.rdf      RRX.RDFXML_ARP1  avgt    3  82,805 ± 15,137   s/op
TestXMLParser.parseXMLJena510        bsbm-5m.xml       RRX.RDFXML_SAX  avgt    3  13,158 ±  1,557   s/op
TestXMLParser.parseXMLJena510        bsbm-5m.xml   RRX.RDFXML_StAX_sr  avgt    3  13,004 ±  2,103   s/op
TestXMLParser.parseXMLJena510        bsbm-5m.xml      RRX.RDFXML_ARP1  avgt    3  20,122 ± 25,073   s/op
TestXMLParser.parseXMLJena510      citations.rdf       RRX.RDFXML_SAX  avgt    3  59,701 ±  5,946   s/op
TestXMLParser.parseXMLJena510      citations.rdf   RRX.RDFXML_StAX_sr  avgt    3  58,997 ±  3,897   s/op
TestXMLParser.parseXMLJena510      citations.rdf      RRX.RDFXML_ARP1  avgt    3  85,229 ±  7,513   s/op

Are you interested in contributing a solution yourself?

Yes

afs commented 2 days ago

ARP1 isn't relevant.

afs commented 2 days ago
  • removed httpClient from org.apache.jena.riot.RDFParserBuilder and org.apache.jena.riot.RDFParser, which took quite some time during initialization. --> now org.apache.jena.http.HttpEnv#getDftHttpClient is called from org.apache.jena.riot.RDFParser#openTypedInputStream only if needed. HttpEnvalso holds a static reference, so that should be fine.

That is removing a significant capability.

Which part of initialization? General system startup or specific to RDf/XML parsing after JenaSystem.init has happened?

arne-bdt commented 2 days ago

ARP1 isn't relevant.

I did not change anything for the deprecated ARP-variants. It was just for reference. (in your email, you measured riot --syntax arp1 --time --count --sink citations.rdf.)

arne-bdt commented 2 days ago
  • removed httpClient from org.apache.jena.riot.RDFParserBuilder and org.apache.jena.riot.RDFParser, which took quite some time during initialization. --> now org.apache.jena.http.HttpEnv#getDftHttpClient is called from org.apache.jena.riot.RDFParser#openTypedInputStream only if needed. HttpEnvalso holds a static reference, so that should be fine.

That is removing a significant capability.

Which part of initialization? General system startup or specific to RDf/XML parsing after JenaSystem.init has happened?

see https://github.com/apache/jena/pull/2744#discussion_r1780045344