enso-org / enso

Hybrid visual and textual functional programming.
https://enso.org
Apache License 2.0
7.32k stars 318 forks source link

Investigate if we can avoid Polyglot conversions and if it will be beneficial in practice #7378

Open radeusgd opened 11 months ago

radeusgd commented 11 months ago

Benchmarks suggested that performing polyglot conversions incurs a significant performance penalty when invoking Java-to-Enso callbacks.

Initially, the hypothesis was that the conversion is only needed when date values are expected, so in many scenarios it can be avoided.

Further work shows, that we can also encounter dataflow errors as results of an Enso callback and to correctly handle these, a conversion is needed also. This significantly reduces the amount of functions that can benefit from this optimization. Moreover, it makes the system less resistant to errors - we have to be extra careful that optimized operations will never throw a dataflow error, as with the optimization turned on for a particular operation a dataflow error will result in a panic.

Thus the removal has not been implemented yet.

The goal of this ticket is to investigate where it could really be safe to remove the conversions and if it will give us any benefits.

Similarly, if we want to avoid the conversions in Column.from_vector once #6111 is implemented, we'd need to ensure that a Vector cannot hold a dataflow error - i.e. we would need to first get #5529 fixed.

radeusgd commented 11 months ago

A patch implementing a PoC of avoiding the conversions if we say that there are no dataflow errors and no dates are accepted by the given type:

Subject: [PATCH] Revert avoiding conversions as it's problematic. Could be revisited later properly.
---
Index: distribution/lib/Standard/Table/0.0.0-dev/src/Data/Column.enso
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/distribution/lib/Standard/Table/0.0.0-dev/src/Data/Column.enso b/distribution/lib/Standard/Table/0.0.0-dev/src/Data/Column.enso
--- a/distribution/lib/Standard/Table/0.0.0-dev/src/Data/Column.enso    (revision 2710c15d5a777dad5141c65169e2c8e325e985f9)
+++ b/distribution/lib/Standard/Table/0.0.0-dev/src/Data/Column.enso    (revision d8ccd7dd09b8a63681402ef1479cd2315b0848d9)
@@ -2118,16 +2118,21 @@
      handling logic.
    - expected_result_type: The expected result type of the operation. If set to
      `Nothing`, the result type is inferred from the values.
-run_binary_op column function operand new_name skip_nulls=True expected_result_type=Nothing =
+   - expect_dataflow_errors: Specifies if `function` may raise dataflow errors.
+     If `expect_dataflow_errors` is set to `False` (default), the polyglot code
+     may run some optimizations that may yield invalid behavior if a dataflow
+     error is returned. If the function may raise errors, this should be set to
+     `True`.
+run_binary_op column function operand new_name skip_nulls=True expected_result_type=Nothing expect_dataflow_errors=False =
     s = column.java_column.getStorage
     storage_type = resolve_storage_type expected_result_type
     new_storage = case operand of
         other_column : Column ->
             Polyglot_Helpers.handle_polyglot_dataflow_errors <|
-                s.zip function other_column.java_column.getStorage skip_nulls storage_type
+                s.zip function other_column.java_column.getStorage skip_nulls storage_type expect_dataflow_errors
         _ ->
             Polyglot_Helpers.handle_polyglot_dataflow_errors <|
-                s.biMap function operand skip_nulls storage_type
+                s.biMap function operand skip_nulls storage_type expect_dataflow_errors
     Column.Value (Java_Column.new new_name new_storage)

 ## PRIVATE
@@ -2143,11 +2148,16 @@
      handling logic.
    - expected_result_type: The expected result type of the operation. If set to
      `Nothing`, the result type is inferred from the values.
-run_unary_op column function new_name skip_nulls=True expected_result_type=Nothing =
+   - expect_dataflow_errors: Specifies if `function` may raise dataflow errors.
+     If `expect_dataflow_errors` is set to `False` (default), the polyglot code
+     may run some optimizations that may yield invalid behavior if a dataflow
+     error is returned. If the function may raise errors, this should be set to
+     `True`.
+run_unary_op column function new_name skip_nulls=True expected_result_type=Nothing expect_dataflow_errors=False =
     s = column.java_column.getStorage
     storage_type = resolve_storage_type expected_result_type
     new_storage = Polyglot_Helpers.handle_polyglot_dataflow_errors <|
-        s.unaryMap function skip_nulls storage_type
+        s.unaryMap function skip_nulls storage_type expect_dataflow_errors
     Column.Value (Java_Column.new new_name new_storage)

 ## PRIVATE
Index: std-bits/table/src/main/java/org/enso/table/data/column/storage/Storage.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/std-bits/table/src/main/java/org/enso/table/data/column/storage/Storage.java b/std-bits/table/src/main/java/org/enso/table/data/column/storage/Storage.java
--- a/std-bits/table/src/main/java/org/enso/table/data/column/storage/Storage.java  (revision 2710c15d5a777dad5141c65169e2c8e325e985f9)
+++ b/std-bits/table/src/main/java/org/enso/table/data/column/storage/Storage.java  (revision d8ccd7dd09b8a63681402ef1479cd2315b0848d9)
@@ -16,6 +16,7 @@
 import org.enso.table.data.column.storage.type.StorageType;
 import org.enso.table.data.mask.OrderMask;
 import org.enso.table.data.mask.SliceRange;
+import org.enso.table.util.Polyglot_Helper;
 import org.graalvm.polyglot.Context;
 import org.graalvm.polyglot.Value;

@@ -133,19 +134,33 @@
    * @param skipNa whether rows containing missing values should be passed to the function.
    * @param expectedResultType the expected type for the result storage; it is ignored if the
    *     operation is vectorized
+   * @param expectDataflowErrors specifies if this function should be prepared to handle dataflow
+   *     errors coming from `function` (which may add a bit of overhead)
    * @return the result of running the function on each row
    */
   public final Storage<?> unaryMap(
-      Function<Object, Value> function, boolean skipNa, StorageType expectedResultType) {
+      Function<Object, Object> function,
+      boolean skipNa,
+      StorageType expectedResultType,
+      boolean expectDataflowErrors) {
     Builder storageBuilder = Builder.getForType(expectedResultType, size());
+    final boolean needsConversion =
+        Polyglot_Helper.isPolyglotConversionNeeded(expectedResultType, expectDataflowErrors);
     final Context context = Context.getCurrent();
     for (int i = 0; i < size(); i++) {
       Object it = getItemBoxed(i);
       if (skipNa && it == null) {
         storageBuilder.appendNulls(1);
       } else {
-        Value result = function.apply(it);
-        Object converted = Polyglot_Utils.convertPolyglotValue(result);
+        Object result = function.apply(it);
+        // TODO [RW] - I'm not 100% sure this is actually improving the performance much without
+        // again measuring it, we may need to revise this.
+        // (the original code that fas faster had no conversions, but here we are adding a branch
+        // which may actually
+        // hurt performance more, the branch is on an effective constant so it may not be a big
+        // deal, but only way to
+        // tell is to measure.)
+        Object converted = needsConversion ? Polyglot_Utils.convertPolyglotValue(result) : result;
         storageBuilder.appendNoGrow(converted);
       }

@@ -163,19 +178,25 @@
    *     without passing them through the function, this is useful if the function does not support
    *     the null-values, but it needs to be set to false if the function should handle them.
    * @param expectedResultType the expected type for the result storage
+   * @param expectDataflowErrors specifies if this function should be prepared to handle dataflow
+   *     errors coming from `function` (which may add a bit of overhead)
    * @return a new storage containing results of the function for each row
    */
   public final Storage<?> biMap(
       BiFunction<Object, Object, Object> function,
       Object argument,
       boolean skipNulls,
-      StorageType expectedResultType) {
+      StorageType expectedResultType,
+      boolean expectDataflowErrors) {
+
     Builder storageBuilder = Builder.getForType(expectedResultType, size());
     if (skipNulls && argument == null) {
       storageBuilder.appendNulls(size());
       return storageBuilder.seal();
     }

+    final boolean needsConversion =
+        Polyglot_Helper.isPolyglotConversionNeeded(expectedResultType, expectDataflowErrors);
     final Context context = Context.getCurrent();
     for (int i = 0; i < size(); i++) {
       Object it = getItemBoxed(i);
@@ -183,7 +204,9 @@
         storageBuilder.appendNulls(1);
       } else {
         Object result = function.apply(it, argument);
-        Object converted = Polyglot_Utils.convertPolyglotValue(result);
+        // TODO [RW] - I'm not 100% sure this is actually improving the performance much without
+        // again measuring it, we may need to revise this
+        Object converted = needsConversion ? Polyglot_Utils.convertPolyglotValue(result) : result;
         storageBuilder.appendNoGrow(converted);
       }

@@ -199,14 +222,19 @@
    * @param skipNa whether rows containing missing values should be passed to the function.
    * @param expectedResultType the expected type for the result storage; it is ignored if the
    *     operation is vectorized
+   * @param expectDataflowErrors specifies if this function should be prepared to handle dataflow
+   *     errors coming from `function` (which may add a bit of overhead)
    * @return the result of running the function on all non-missing elements.
    */
   public final Storage<?> zip(
       BiFunction<Object, Object, Object> function,
       Storage<?> arg,
       boolean skipNa,
-      StorageType expectedResultType) {
+      StorageType expectedResultType,
+      boolean expectDataflowErrors) {
     Builder storageBuilder = Builder.getForType(expectedResultType, size());
+    final boolean needsConversion =
+        Polyglot_Helper.isPolyglotConversionNeeded(expectedResultType, expectDataflowErrors);
     final Context context = Context.getCurrent();
     for (int i = 0; i < size(); i++) {
       Object it1 = getItemBoxed(i);
@@ -215,7 +243,9 @@
         storageBuilder.appendNulls(1);
       } else {
         Object result = function.apply(it1, it2);
-        Object converted = Polyglot_Utils.convertPolyglotValue(result);
+        // TODO [RW] - I'm not 100% sure this is actually improving the performance much without
+        // again measuring it, we may need to revise this
+        Object converted = needsConversion ? Polyglot_Utils.convertPolyglotValue(result) : result;
         storageBuilder.appendNoGrow(converted);
       }

@@ -241,14 +271,14 @@
   public final Storage<?> vectorizedOrFallbackUnaryMap(
       String name,
       MapOperationProblemBuilder problemBuilder,
-      Function<Object, Value> fallback,
+      Function<Object, Object> fallback,
       boolean skipNa,
       StorageType expectedResultType) {
     if (isUnaryOpVectorized(name)) {
       return runVectorizedUnaryMap(name, problemBuilder);
     } else {
       checkFallback(fallback, expectedResultType, name);
-      return unaryMap(fallback, skipNa, expectedResultType);
+      return unaryMap(fallback, skipNa, expectedResultType, false);
     }
   }

@@ -278,7 +308,7 @@
       return runVectorizedBiMap(name, argument, problemBuilder);
     } else {
       checkFallback(fallback, expectedResultType, name);
-      return biMap(fallback, argument, skipNulls, expectedResultType);
+      return biMap(fallback, argument, skipNulls, expectedResultType, false);
     }
   }

@@ -308,7 +338,7 @@
       return runVectorizedZip(name, other, problemBuilder);
     } else {
       checkFallback(fallback, expectedResultType, name);
-      return zip(fallback, other, skipNulls, expectedResultType);
+      return zip(fallback, other, skipNulls, expectedResultType, false);
     }
   }

Index: std-bits/table/src/main/java/org/enso/table/util/Polyglot_Helper.java
===================================================================
diff --git a/std-bits/table/src/main/java/org/enso/table/util/Polyglot_Helper.java b/std-bits/table/src/main/java/org/enso/table/util/Polyglot_Helper.java
new file mode 100644
--- /dev/null   (revision d8ccd7dd09b8a63681402ef1479cd2315b0848d9)
+++ b/std-bits/table/src/main/java/org/enso/table/util/Polyglot_Helper.java (revision d8ccd7dd09b8a63681402ef1479cd2315b0848d9)
@@ -0,0 +1,27 @@
+package org.enso.table.util;
+
+import org.enso.table.data.column.storage.type.AnyObjectType;
+import org.enso.table.data.column.storage.type.DateTimeType;
+import org.enso.table.data.column.storage.type.DateType;
+import org.enso.table.data.column.storage.type.StorageType;
+import org.enso.table.data.column.storage.type.TimeOfDayType;
+
+public class Polyglot_Helper {
+  public static boolean isPolyglotConversionNeeded(
+      StorageType expectedType, boolean expectDataflowErrors) {
+    if (expectDataflowErrors) {
+      return true;
+    }
+
+    if (expectedType == null) {
+      return true;
+    }
+
+    boolean mayHoldDates =
+        expectedType instanceof AnyObjectType
+            || expectedType instanceof DateType
+            || expectedType instanceof DateTimeType
+            || expectedType instanceof TimeOfDayType;
+    return mayHoldDates;
+  }
+}

It may be used as a starting point for this, but amendments may still be needed.