Open wgip opened 3 years ago
I've spent quite some time to try to link the external libraries with ncc. While I was able to build that with ncc
(so should work with nc++ too) by using Arrow-GLib. However, this only works when we specify the paths to the libraries manually. However, it seems that for running Aurora4j and PyVeo depend on nld
to dynamically link the libraries.
The problem is that nld
doesn't seem to be compatible with the libraries as it logs the following information:
nld: skipping incompatible /usr/lib64/libarrow-glib.so when searching for -larrow-glib
nld: skipping incompatible /usr/lib64/libarrow-glib.a when searching for -larrow-glib
Not really sure yet why this occurs.
@Wosin libarrow-glib.so would need to be compiled for aurora architecture as well
you probably won't be able to link x86 libs into an aurora project
Yeah, I was thinking if there is a way to link them statically though.
It seems that WholeStageCodegen is represented as a parent plan that keeps the original plan as child
. So, basically if we have something like HashAggregate
and that single operation gets transformed by whole stage codegen, then a new spark plan is created WholeStageCodeGenExec
which will have the HashAggregate
as a `child.
What is more, it seems from my experiments that the wholestage codegen is actually executed after our plugin not before, so I don't think we need to actually disable that at all.
More notes:
object VeoGenericPlanExtractor {
def matchPlan(sparkPlan: SparkPlan): Option[GenericSparkPlanDescription] = {
PartialFunction.condOpt(sparkPlan) {
case first @ HashAggregateExec(
requiredChildDistributionExpressions,
groupingExpressions,
exprs,
aggregateAttributes,
initialInputBufferOffset,
resultExpressions,
org.apache.spark.sql.execution.exchange
.ShuffleExchangeExec(
outputPartitioning,
f @ org.apache.spark.sql.execution.aggregate
.HashAggregateExec(
_requiredChildDistributionExpressions,
_groupingExpressions,
_aggregateExpressions,
_aggregateAttributes,
_initialInputBufferOffset,
_resultExpressions,
fourth @ sparkPlan
),
shuffleOrigin
)
) => {
val columnIndices = fourth.output.map(_.name).zipWithIndex.toMap
println(s"CID => ${columnIndices}")
println(s"First =$first")
val functions =
_aggregateExpressions.map(_.aggregateFunction.asInstanceOf[DeclarativeAggregate])
val resIds =
_aggregateExpressions.map(_.resultId)
val initExpr = functions.map(f => f.initialValues)
val aggregateBufferAttributes: Seq[AttributeReference] =
_aggregateExpressions.flatMap(_.aggregateFunction.aggBufferAttributes)
val aggResults =
bindReferences(functions.map(_.evaluateExpression), aggregateBufferAttributes)
val resultVars = bindReferences(resultExpressions, aggregateAttributes)
List(
"modes" -> _aggregateExpressions.map(_.mode).distinct,
"rcde" -> _requiredChildDistributionExpressions.toSeq.flatten,
"ge" -> _groupingExpressions,
"ae" -> _aggregateExpressions,
"aa" -> _aggregateAttributes,
"allAtts" -> f.allAttributes.attrs,
"re" -> _resultExpressions,
"fo" -> fourth.output,
"bufVars" -> initExpr,
"resIds" -> resIds,
"preAggResults" -> functions.map(_.evaluateExpression),
"aggResults" -> aggResults,
"resultVars" -> resultVars,
"aggregateBufferAttributes" -> aggregateBufferAttributes,
"out" -> f.output
).foreach { case (k, vs) =>
println(s"$k: ==>")
vs.foreach(println)
println(" ")
}
val columnMappings = exprs
.map(expression => (expression, extractAttributes(expression.references)))
.zipWithIndex
.map { case ((operation, attributes), id) =>
ColumnAggregationExpression(
attributes.map(attr => Column(columnIndices(attr.value), attr.value)),
operation,
id
)
}
First =HashAggregate(keys=[], functions=[sum((_1#37 + _2#38)), avg((_2#38 - _1#37)), sum(_3#39)], output=[sum((_1 + _2))#47, avg((_2 - _1))#48, sum(_3)#49])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#33]
+- HashAggregate(keys=[], functions=[partial_sum((_1#37 + _2#38)), partial_avg((_2#38 - _1#37)), partial_sum(_3#39)], output=[sum#64, sum#65, count#66L, sum#67])
+- LocalTableScan [_1#37, _2#38, _3#39]
modes: ==>
Partial
rcde: ==>
ge: ==>
ae: ==>
partial_sum((_1#37 + _2#38))
partial_avg((_2#38 - _1#37))
partial_sum(_3#39)
aa: ==>
sum#60
sum#61
count#62L
sum#63
allAtts: ==>
_1#37
_2#38
_3#39
sum#60
sum#61
count#62L
sum#63
sum#60
sum#61
count#62L
sum#63
sum#64
sum#65
count#66L
sum#67
re: ==>
sum#64
sum#65
count#66L
sum#67
fo: ==>
_1#37
_2#38
_3#39
bufVars: ==>
List(null)
List(0.0, 0)
List(null)
resIds: ==>
ExprId(44,d84a5437-c39a-4a61-80ea-245df8effc7c)
ExprId(45,d84a5437-c39a-4a61-80ea-245df8effc7c)
ExprId(46,d84a5437-c39a-4a61-80ea-245df8effc7c)
preAggResults: ==>
sum#60
(sum#61 / cast(count#62L as double))
sum#63
aggResults: ==>
input[0, double, true]
(input[1, double, true] / cast(input[2, bigint, true] as double))
input[3, double, true]
resultVars: ==>
input[0, double, true] AS sum((_1 + _2))#47
input[1, double, true] AS avg((_2 - _1))#48
input[2, double, true] AS sum(_3)#49
aggregateBufferAttributes: ==>
sum#60
sum#61
count#62L
sum#63
out: ==>
sum#64
sum#65
count#66L
sum#67
CID => Map(_1 -> 0, _2 -> 1, _3 -> 2)
First =HashAggregate(keys=[], functions=[sum((_1#37 + _2#38)), avg((_2#38 - _1#37)), sum(_3#39)], output=[sum((_1 + _2))#47, avg((_2 - _1))#48, sum(_3)#49])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#38]
+- HashAggregate(keys=[], functions=[partial_sum((_1#37 + _2#38)), partial_avg((_2#38 - _1#37)), partial_sum(_3#39)], output=[sum#64, sum#65, count#66L, sum#67])
+- LocalTableScan [_1#37, _2#38, _3#39]
modes: ==>
Partial
rcde: ==>
ge: ==>
ae: ==>
partial_sum((_1#37 + _2#38))
partial_avg((_2#38 - _1#37))
partial_sum(_3#39)
aa: ==>
sum#60
sum#61
count#62L
sum#63
allAtts: ==>
_1#37
_2#38
_3#39
sum#60
sum#61
count#62L
sum#63
sum#60
sum#61
count#62L
sum#63
sum#64
sum#65
count#66L
sum#67
re: ==>
sum#64
sum#65
count#66L
sum#67
fo: ==>
_1#37
_2#38
_3#39
bufVars: ==>
List(null)
List(0.0, 0)
List(null)
resIds: ==>
ExprId(44,d84a5437-c39a-4a61-80ea-245df8effc7c)
ExprId(45,d84a5437-c39a-4a61-80ea-245df8effc7c)
ExprId(46,d84a5437-c39a-4a61-80ea-245df8effc7c)
preAggResults: ==>
sum#60
(sum#61 / cast(count#62L as double))
sum#63
aggResults: ==>
input[0, double, true]
(input[1, double, true] / cast(input[2, bigint, true] as double))
input[3, double, true]
resultVars: ==>
input[0, double, true] AS sum((_1 + _2))#47
input[1, double, true] AS avg((_2 - _1))#48
input[2, double, true] AS sum(_3)#49
aggregateBufferAttributes: ==>
sum#60
sum#61
count#62L
sum#63
out: ==>
sum#64
sum#65
count#66L
sum#67
VE + something + VE
, we should considering squashing thatsomething
, for example anExchange
, if it is possible. In particular, if it's running on the same VH. "The VE Sandwich".SUM(a+b), AVG(x - y)
" run as: "a+b
, thenx - y
, thenSUM
, thenAVG
, ...". My expectation is the compiled approach would be much faster, but it's important to test this out.Discussion was based around this:
Outcomes: