Swirrl / drafter

A clojure service and a client to it for exposing data management operations to PMD
Other
0 stars 0 forks source link

Add benchmarks for drafter operations #636

Closed lkitching closed 2 years ago

lkitching commented 2 years ago

Benchmark results

This PR contains the implementation of a benchmarking harness for drafter along with two small clojure CLI projects to generate test data and visualise the results. The benchmarks are defined in java using JMH and measure the performance of the operations affected by draft rewriting - append data, delete data, delete graph, publish and SPARQL update queries.

Data generation

Each benchmark tests one of the above operations on input data parameterised along three dimensions - the total number of statements, the number of graphs and the percentage of 'graph-referencing' statements. A graph-referencing statement is one which references a draftset graph in either the subject, predicate or object positions. The proposed changes in #614 modify the draftset rewriting queries to account for such statements, and the initial purpose of these benchmarks is to ensure these changes do not adversely affect performance too much.

The data-gen project in this PR is used to generate random test data subject to specified values for these parameters. There are 4 possible values used for each of these dimensions, leading to 64 possible test inputs for each operation. These values are:

The generate-all task of the data-gen project can be used to generate all the data files required by the benchmarks into a specified data directory.

Running the benchmarks

The benchmark project is configured with a dependency for the drafter version to benchmark and packaged as a single uberjar. The benchmarks themselves were run on an e2-highmem-8 (8 vCPU, 64Gb memory) instance on the Google Cloud Platform. This was configured with an instance of stardog 6.3.2 with the following environment:

STARDOG_SERVER_JAVA_ARGS="-Djava.io.tmpdir=/var/lib/stardog-home/tmp -Dcom.sun.management.jmxremote.port=5833 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlog4j2.formatMsgNoLookups=true -Xms16g -Xmx24g -XX:MaxDirectMemorySize=24g

This sets the min and max heap size to 16Gb and 24Gb respectively.

The benchmarks were run twice against two different versions of drafter - one with the old draftset rewriting queries, and one with the new approach development in #614. These versions are referred to as the 'old' and 'new' versions in the results below.

Results

Each run of the benchmarks wrote their results to a jmh-result.csv file. Below are the files obtained for the two versions:

jmh-result-old.csv jmh-result-new.csv

The perf-charts project was used to generate comparison charts for the two versions across all available dimensions. Below is a brief discussion of the findings for each operation and the impact of different graph sizes and number of graph-referencing statements.

You can generate all the charts locally by downloading these two files and running

clj -M -m perf-charts.main --directory charts jmh-result-old.csv jmh-result-new.csv

in the perf-charts project directory.

Appending data

Below is a chart showing the performance of the two versions when appending different numbers of statements into a single graph with no graph-referencing statements.

append-1g-0pc

The graph for the two versions inserting 100k statements into a variable number of graphs however shows a significant performance drop in the new version as the number of graphs increases over 100

append-100k-0pc

This can be seen in the graph showing performance for inserting increasing number of statements into 200 graphs, where the new version is consistently slower

append-200g-0pc

Most real-world draftsets will probably not contain such a large number of graphs, but this behaviour could warrant further investigation.

The performance for appending 100k statements into 10 graphs does not appear to be affected by the number of graph-referencing statements in the data for either the old or new versions

append-100k-10g

Deleting data

The delete benchmarks first insert data from the corresponding data file, then delete half of it from an associated 'deletion' file generated by the data generator. Note, the 'large' deletion benchmarks for deleting 1M statements reliably caused the stardog instance to crash when calculating statistics, so they were excluded. It's not clear if this is an issue with stardog, or an artifact of the bechmarking process, so further investigation may be useful.

Similar to the append benchmark, there's little difference between the new and old versions when deleting from a single graph delete-1g-0pc

Unlike the append operation, the delete operation does not seem to degrade as the number of graphs increases. Note that since only half the source data is deleted, this only shows the performance of deleting 50k statements, not 100k as in the append benchmark.

delete-100k-0pc

Deleting graphs

The 'delete graph' benchmarks append all the data into a new draftset and then choose a random graph to delete. Below is the chart showing the performance for deleting 1 of 10 graphs as the total number of statements increases

deleteGraph-10g-0pc

This shows the new approach appears to be slightly quicker for larger numbers of quads, although this is a cheap operation in both versions.

Strangely, performance seems to improve as the number of graphs increase

deleteGraph-100k-0pc

Since stardog is not restarted between benchmark runs, it's possible this is due to some optimisations in the stardog JIT as it runs.

As with the append operation, the performance is not affected by the percentage of graph-referencing statements when deleting from 100k statements across 10 graphs

delete-100k-10g

As with the append and delete operations, performance does not appear to change significantly as the percentage of graph-referencing statements increases

deleteGraph-100k-10g

Publishing

The publish benchmarks first append the associated data file into a new draftset and then publish the draftset to live. Below is the results for publishing a single graph containing non-graph-referencing statements

publish-1g-0pc

Performance of the two versions is almost identical. The new version becomes slightly slower as the number of graphs increases for 100k statements.

publish-100k-0pc

Performance for the new version is noticably worse however when publishing draftsets containing 1M quads as the number of graphs increase

publish-1000k-0pc

As with the other operations, performance does not appear to change as the percentage of graph-referencing statements in the draftset increases

publish-100k-10g

SPARQL Update

The SPAQRL update benchmarks append the contents of the data file into a new draftset, choose a random quad to delete and construct a SPARQL update query to delete it. The performance of the old and new versions are quite similar for a draftset containing a single graph of non-graph-referencing statements

updateQuery-1g-0pc

similarly as the number of graphs increase

updateQuery-100k-0pc

Once again the performance does not seem to change as the number of graph-referencing statements increase within a draftset of 100k statements across 10 graphs

updateQuery-100k-10g

Conclusions

For the delete, delete graph and SPARQL update operations, performance appears to be very similar between the old and new rewriting approaches for all data sizes. Append operation performance is similar for a small-to-moderate number of graphs but is noticably slower as the number increases over 100. This may be sufficient for most real-world use although further investigation into the performance of the generated rewriting queries may be useful. Publishing performance is also quite similar in the new version, although it suffers in comparison when both the number of statements and the number of graphs is high. Again, this may not be a problem for most use cases, although some large draftsets are sometimes created in production.

Further work

The current benchmark harness is written in Java since this most easily interoperates with JMH. There is a jmh-clojure project which generates benchmarks dynamically from an EDN configuration file. This might reduce the maintenance burden slightly since it would remove the need to install the drafter jar into the local Maven repository to build the benchmark harness. On the other hand, it requires some runtime configuration since it dynamically invokes JMH to generate the benchmark classes and compiles them at runtime.

JMH is mainly used for microbenchmarking methods within a single JVM. The benchmarks defined here spend most of their time executing queries against a remote stardog instance in a different JVM and so do not benefit from most of the optimisation mitigations that JMH benchmarks make. It may make sense to write a small clojure harness instead to control a remote stardog instance to start/stop between benchmark operations and perform some generic 'warmup' operations to simulate a production workload. Using clojure would also allow benchmarks to be generated using the language's metaprogramming facilities rather than having to manually create custom JMH state classes for each benchmark. This would make adding more test cases much easier and afford a more precise view of how the different data parameters affect performance.

RickMoynihan commented 2 years ago

This is great, thanks @lkitching! 🙇

Judging by the analysis it looks to me like real world performance is unlikely to be significantly impacted by the fixes for #607. In particular it seems there's only a marginal difference in append and delete performance where you have no graph referencing statements:

graph referencing statements

Similarly we would expect changes within a draft to occur only on a small number of graphs (typically just a few; in practice unlikely to be more than 10).

Publishing performance also seems to be only mildly affected, and only mildly when you're referencing graphs in the data:

On the basis of this information I think we can safely merge the changes in PR #614

I think your points about hotspot and stardog are valid, but it's important to note these aren't really microbenchmarks, but macro ones. It's my belief that the results are likely still representative, because stardog was running across many of the tests, which are repeatedly exercising similar operations. I think if we were to try and protect against this more in the future, we'd be better perhaps just running the tests from the largest down to the smallest -- as the hotspot differences and warm up will likely be swallowed into background noise on the larger operations, and the smaller more representative tests will be hot by the time they're exercised, and it is the smaller tests where the difference between being warm and cold would make the most difference. Regardless I see no need to repeat the tests at this stage.