Speed benchmarks up by shortening the 10s iteration time

JaroslavTulach commented 1 year ago

While using the new IfVsCaseBenchmark I realized we can specify time for each @Warmup and @Measurement. The regular time is 10s, but it seems 1s for each warm up and 3s for a measurement is good enough for the IfVsCaseBenchmark. Given it has five warmup iterations and three measurements we are down from 80s to 11s. Of course, shortening the warmup time too much may decrease stability of our benchmarks, but the overall speed up seems too attractive to not try that.

[ ] Implement Truffle compilation detection
[ ] Investigate if time can be lowered from 10s for various benchmarks
[ ] Run the results few times on the server
[ ] Verify benchmarks run faster
[ ] Verify benchmark results are affected by Truffle compilation

CCing @jdunkerley , @Akirathan

Akirathan commented 1 year ago

As long as there is no, or almost no, background compilation in the measurement phase, then yes, we can decrease the warmup time. I would be careful about it, though. For example, with the current JMH framework we use, there is no simple way how to tell that there are still some compilations in the measurement phase. Note that if we want to measure only the performance of Enso, and not performance of Graal compilations, all the Graal compilations should be done during the warmup phase.

In FastR, I reimplemented the whole benchmarking infrastructure from the scratch. I started every benchmark with --engine.TraceCompilationDetails, captured the whole output with timestamps, and once the benchmark ended, I found out whether there were any engine compilation logs during the measurement phase.

This is just an example of almost reliable way how to detect that the warmup was sufficient. Before we move on with the investigation, I strongly recommend implementing a similar system to recognize a sufficient warmup and modify the times accordingly.

I am not even mentioning that it would be a very good idea to introduce benchmark versioning if we are about to fiddle with any parameters. More details in #5714

JaroslavTulach commented 1 year ago

almost reliable way how to detect that the warmup was sufficient.

I think that tracing the Truffle compilation and failing if it happens during measurement is a good idea. Here is an implementation for Enso benchmarks:

enso$ git diff engine/runtime/src/bench/java/org/enso/interpreter/bench/benchmarks/semantic/ListBenchmarks.java
diff --git engine/runtime/src/bench/java/org/enso/interpreter/bench/benchmarks/semantic/ListBenchmarks.java engine/runtime/src/bench/java/org/enso/interpreter/bench/benchmarks/semantic/ListBenchmarks.java
index ef8b748ab8..481b607551 100644
--- engine/runtime/src/bench/java/org/enso/interpreter/bench/benchmarks/semantic/ListBenchmarks.java
+++ engine/runtime/src/bench/java/org/enso/interpreter/bench/benchmarks/semantic/ListBenchmarks.java
@@ -1,9 +1,12 @@
 package org.enso.interpreter.bench.benchmarks.semantic;

-import java.io.ByteArrayOutputStream;
 import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.List;
 import java.util.concurrent.TimeUnit;
 import java.util.function.Function;
+import java.util.logging.Handler;
+import java.util.logging.LogRecord;
 import org.graalvm.polyglot.Context;
 import org.graalvm.polyglot.Value;
 import org.openjdk.jmh.annotations.Benchmark;
@@ -18,12 +21,14 @@ import org.openjdk.jmh.annotations.State;
 import org.openjdk.jmh.annotations.Warmup;
 import org.openjdk.jmh.infra.BenchmarkParams;
 import org.openjdk.jmh.infra.Blackhole;
+import org.openjdk.jmh.infra.IterationParams;
+import org.openjdk.jmh.runner.IterationType;

 @BenchmarkMode(Mode.AverageTime)
 @Fork(1)
-@Warmup(iterations = 3)
-@Measurement(iterations = 5)
+@Warmup(iterations = 1, time = 1)
+@Measurement(iterations = 5, time = 1)
 @OutputTimeUnit(TimeUnit.MILLISECONDS)
 @State(Scope.Benchmark)
 public class ListBenchmarks {
@@ -33,6 +38,7 @@ public class ListBenchmarks {
   private Value self;
   private Value sum;
   private Value oldSum;
+  private List<LogRecord> messages = new ArrayList<>();

   @Setup
   public void initializeBenchmark(BenchmarkParams params) throws Exception {
@@ -40,7 +46,21 @@ public class ListBenchmarks {
       .allowExperimentalOptions(true)
       .allowIO(true)
       .allowAllAccess(true)
-      .logHandler(new ByteArrayOutputStream())
+      .option("engine.TraceCompilation", "true")
+      .logHandler(new Handler() {
+        @Override
+        public void publish(LogRecord lr) {
+          messages.add(lr);
+        }
+
+        @Override
+        public void flush() {
+        }
+
+        @Override
+        public void close() throws SecurityException {
+        }
+      })
       .option(
         "enso.languageHomeOverride",
         Paths.get("../../distribution/component").toFile().getAbsolutePath()
@@ -113,16 +133,18 @@ public class ListBenchmarks {
   }

   @Benchmark
-  public void mapOverList(Blackhole matter) {
-    performBenchmark(matter);
+  public void mapOverList(IterationParams it, Blackhole matter) {
+    performBenchmark(it, matter);
   }

   @Benchmark
-  public void mapOverLazyList(Blackhole matter) {
-    performBenchmark(matter);
+  public void mapOverLazyList(IterationParams it, Blackhole matter) {
+    performBenchmark(it, matter);
   }

-  private void performBenchmark(Blackhole hole) throws AssertionError {
+  private void performBenchmark(IterationParams it, Blackhole hole) throws AssertionError {
+    messages.clear();
+
     var newList = plusOne.execute(self, list);
     var newSum = sum.execute(self, newList, 0);

@@ -131,6 +153,14 @@ public class ListBenchmarks {
       throw new AssertionError("Unexpected result " + result);
     }
     hole.consume(result);
+    if (it.getType() == IterationType.MEASUREMENT) {
+      if (messages.size() > 0) {
+        for (var lr : messages) {
+          System.err.println(lr.getMessage());
+          throw new AssertionError("Found compilation " + messages.size() + " messages during measurement");
+        }
+      }
+    }
   }
 }

With these changes (especially after changing the number of warmups to 1 and giving it just 1s) my:

sbt:runtime> benchOnly ListBenchmarks
Iteration   1: 
opt done   id=3593  org.graalvm.polyglot.Value<Function>.execute       |Tier 2|Time   593( 239+354 )ms|AST   47|Inlined  53Y   0N|IR   3573/ 17618|CodeSize   74181|Addr 0x7f0707edca90|Timestamp 91619721533011|Src n/a
<failure>

java.lang.AssertionError: Found compilation 1 messages during measurement
        at org.enso.interpreter.bench.benchmarks.semantic.ListBenchmarks.performBenchmark(ListBenchmarks.java:160)
        at org.enso.interpreter.bench.benchmarks.semantic.ListBenchmarks.mapOverLazyList(ListBenchmarks.java:142)
        at org.enso.interpreter.bench.benchmarks.semantic.jmh_generated.ListBenchmarks_mapOverLazyList_jmhTest:232

This way we can detect skewed results because of compilation not being finished yet before measurements. Of course this sketch would rather be turned into a support TruffleCompilationHandler class and used by all our benchmarks.

JaroslavTulach commented 2 months ago

Regardless we lower the time, we need to detect whether a compilation is on while benchmarking. Being implemented as

10574

enso-org / enso

Speed benchmarks up by shortening the 10s iteration time #6271

10574