Closed parthchandra closed 2 weeks ago
Initial performance numbers for this implementation are not looking good. There are two areas where things are getting slower compared to Spark 1 . No WholestageCodegen - The iteration over rows alone is adding extra cost to the implementation
Running benchmark: ColumnarToRowExec
Running case: Spark Columnar To Row - integer
Stopped after 34 iterations, 2029 ms
Running case: Comet Columnar To Row - integer
Stopped after 24 iterations, 2081 ms
Spark Columnar To Row - integer 40 60 14 262.1 3.8 1.0X Comet Columnar To Row - integer 53 87 32 198.2 5.0 0.8X
@mbutrovich @andygrove any thoughts?
@andygrove @Dandandan thanks for your comments. I've tried this now with a single datatype and no datatype checks and I see the same results. I'm going to close this PR until I can think of some better way to do this.
@parthchandra I experimented with iterating over columns first and then rows. I also modified the code to only create SparkUnsafeRow
once instead of once per row. I now see similar performance between Spark and native.
OpenJDK 64-Bit Server VM 11.0.24+8-post-Ubuntu-1ubuntu322.04 on Linux 6.8.0-47-generic
AMD Ryzen 9 7950X3D 16-Core Processor
ColumnarToRowExec: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Spark Columnar To Row - integer 57 62 4 183.9 5.4 1.0X
Comet Columnar To Row - integer 57 73 17 183.8 5.4 1.0X
Code:
ArrowDataType::Int32 => {
let array = arr
.as_any()
.downcast_ref::<Int32Array>()
.expect("Error downcasting to Int32");
let mut row = SparkUnsafeRow::new(&schema);
let mut row_start_addr: usize = addr;
if array.null_count() == 0 {
for i in 0..num_rows {
let row_size =
SparkUnsafeRow::get_row_bitset_width(schema.len()) + 8 * num_cols;
unsafe {
row.point_to_slice(std::slice::from_raw_parts(
row_start_addr as *const u8,
row_size,
));
}
row.set_int(j, array.value(i));
row_start_addr += row_size;
}
} else {
for i in 0..num_rows {
let row_size =
SparkUnsafeRow::get_row_bitset_width(schema.len()) + 8 * num_cols;
unsafe {
row.point_to_slice(std::slice::from_raw_parts(
row_start_addr as *const u8,
row_size,
));
}
if array.is_null(i) {
row.set_null_at(j);
} else {
row.set_int(j, array.value(i));
}
row_start_addr += row_size;
}
}
}
I considered going column then rows. This gets a little complicated with variable length as now we have to keep the offset for each row at which the variable length part can be written. This part of the code is no longer correct with variable length types -
let row_size = SparkUnsafeRow::get_row_bitset_width(schema.len()) + 8 * num_cols;
@andygrove I tried with your change and I get -
Running benchmark: ColumnarToRowExec
Running case: Spark Columnar To Row - integer
Stopped after 35 iterations, 2037 ms
Running case: Comet Columnar To Row - integer
Stopped after 15 iterations, 2022 ms
OpenJDK 64-Bit Server VM 11.0.19+7-LTS on Mac OS X 14.6
Apple M3 Max
ColumnarToRowExec: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Spark Columnar To Row - integer 42 58 13 249.5 4.0 1.0X
Comet Columnar To Row - integer 52 135 36 200.9 5.0 0.8X
So not really any better. I also tried with another change so that I precompute row lengths and rows start addresses to accomodate variable length fields and look up the row lengths as I iterate over the rows for each column.
That slowed down to 0.7x
of Spark.
I don't think this is going to work, so I'm leaving this closed.
Which issue does this PR close?
Closes #708
Rationale for this change
Columnar to row operation is really slow and this attempts to speed it up.
What changes are included in this PR?
Implements a native columnar to row operation for integer and floating point types. Variable length types can be added later. The implementation allocates an off-heap buffer large enough to hold an entire batch of
UnsafeRow
s, passes it in to the native layer which converts the Arrow vectors intoUnsafeRow
s and puts them into the allocated buffer. The JVM code then simply points itsUnsafeRow
s to the underlying buffer.The native conversion copies the data from the arrow vector into the UnsafeRows. This is currently the performance bottleneck.
How are these changes tested?
Additional unit test.