[GLUTEN-5771][VL] Add metrics for ColumnarArrowEvalPythonExec

apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.

https://gluten.apache.org/

Apache License 2.0

1.03k stars 368 forks source link

[GLUTEN-5771][VL] Add metrics for ColumnarArrowEvalPythonExec #5772

Closed yma11 closed 1 week ago

yma11 commented 2 weeks ago

What changes were proposed in this pull request?

Add metric for ColumnarArrowEvalPythonExec

(Fixes: #5771)

Spark UI

How was this patch tested?

We tested performance of arrow udf and collected some performance:

from pyspark.sql.functions import pandas_udf, PandasUDFType
import pyspark.sql.functions as F
import os
@pandas_udf('long')
def pandas_plus_one(v):
    return (v + 1)

@pandas_udf('string')
def pd_get_first(v):
    return v.str.split(':').str[1]
# test int
df = spark.read.orc("file:///xx/yy").select("user_id").withColumn("processed_user_id", pandas_plus_one("user_id")).select("processed_user_id")
# test string
df = spark.read.orc("file:///xx/yy").select("url").withColumn("processed_url", pd_get_first("url")).select("processed_url")

The perf shows ~20% perf gain compared with vanilla spark.

github-actions[bot] commented 2 weeks ago

https://github.com/apache/incubator-gluten/issues/5771

FelixYBW commented 2 weeks ago

@yma11 can you add a UI chart for the pyarrow UDF? Also add some implementation details?

In theory we can convert Velox to Arrow in Velox pipeline, then pass the arrow pointer to Spark where it's send to python process. There is no C2R and R2C in the whole process and no memcpy between Velox and Spark. Can we achieve this?

yma11 commented 2 weeks ago

@yma11 can you add a UI chart for the pyarrow UDF? Also add some implementation details?

In theory we can convert Velox to Arrow in Velox pipeline, then pass the arrow pointer to Spark where it's send to python process. There is no C2R and R2C in the whole process and no memcpy between Velox and Spark. Can we achieve this?

Yes. There is no C2R and R2C in current implementation. There is a VeloxColumnar to Arrow only. But for memcpy, it depends on the arrow bridge. I found there are still some memory allocation at velox for data types like string. Let me add the implementation under the feature track.

yma11 commented 1 week ago

@yma11 can you add a UI chart for the pyarrow UDF? Also add some implementation details? In theory we can convert Velox to Arrow in Velox pipeline, then pass the arrow pointer to Spark where it's send to python process. There is no C2R and R2C in the whole process and no memcpy between Velox and Spark. Can we achieve this?

Yes. There is no C2R and R2C in current implementation. There is a VeloxColumnar to Arrow only. But for memcpy, it depends on the arrow bridge. I found there are still some memory allocation at velox for data types like string. Let me add the implementation under the feature track.

@FelixYBW The implementation details are now added in 5461. Perf data is also wrapped there. FYI.

yma11 commented 1 week ago

I just noticed that this file (ColumnarArrowEvalPythonExec.scala)'s package is package org.apache.spark.api.python which is wrong. Would you like to fix it? @yma11

Fixed.

yma11 commented 1 week ago

@zhztheplayer Please help take a look again. Thanks.