Replace `Box`es with `Arc` in the `Expr` `enum`.

mustafasrepo commented 4 months ago

Is your feature request related to a problem or challenge?

No response

Describe the solution you'd like

According to following stackoverflow discussion. Boxs can deep copy when called with .clone() method (according to .clone() implementation of the underlying type.).

For Box<Expr> this is the case. I think this usage might be the reason of some deep stack usages seen during the planning.

See related issues: #9375, #8837.

I think, replacing Box<Expr> usages with Arc<Expr> under the enum Expr would improve performance. I am not familiar with the implications of these two approaches in other places. I wonder what community thinks about this change. Would it be better, unnecessary, etc?

Describe alternatives you've considered

No response

Additional context

No response

jayzhan211 commented 4 months ago

~Is it possible to remove Box to lower the cost?~ It seems we need to deal with recursive types

Summary for my note:

whether Box<T> does shallow copy or deep copy depends on T, not always deep copy.
Rc<T> / Arc<T> has overhead so whether they are faster than Box<T> may need benchmarks to ensure that.

mustafasrepo commented 4 months ago

~Is it possible to remove Box to lower the cost?~ It seems we need to deal with recursive types

Summary for my note:

whether Box<T> does shallow copy or deep copy depends on T, not always deep copy.

Rc<T> / Arc<T> has overhead so whether they are faster than Box<T> may need benchmarks to ensure that.

Exactly. I also think that there are pros and cons of these approaches. For recursive types, I think deep clone problem is more evident.

comphead commented 4 months ago

It would be an interesting experiment, @jayzhan211 please elaborate how Box does shallow copy? I checked https://github.com/rust-lang/rust/blob/3cbb93223f33024db464a4df27a13c7cce870173/library/alloc/src/boxed.rs#L1306

It has its own .clone for &str and [] but I didnt see how the same type can be switched between shallow and deep copy.

jayzhan211 commented 4 months ago

It would be an interesting experiment, @jayzhan211 please elaborate how Box does shallow copy? I checked https://github.com/rust-lang/rust/blob/3cbb93223f33024db464a4df27a13c7cce870173/library/alloc/src/boxed.rs#L1306

It has its own .clone for &str and [] but I didnt see how the same type can be switched between shallow and deep copy.

I don't know how to demonstrate if it is shallow copy in rust playground The summary is inferred that Box::clone is doing T::clone, see write_clone_into_raw, it does self.clone() which is T.clone()

#[cfg(not(no_global_oom_handling))]
/// Specialize clones into pre-allocated, uninitialized memory.
/// Used by `Box::clone` and `Rc`/`Arc::make_mut`.
pub(crate) trait WriteCloneIntoRaw: Sized {
    unsafe fn write_clone_into_raw(&self, target: *mut Self);
}

#[cfg(not(no_global_oom_handling))]
impl<T: Clone> WriteCloneIntoRaw for T {
    #[inline]
    default unsafe fn write_clone_into_raw(&self, target: *mut Self) {
        // Having allocated *first* may allow the optimizer to create
        // the cloned value in-place, skipping the local and move.
        unsafe { target.write(self.clone()) };
    }
}

#[cfg(not(no_global_oom_handling))]
impl<T: Copy> WriteCloneIntoRaw for T {
    #[inline]
    unsafe fn write_clone_into_raw(&self, target: *mut Self) {
        // We can always copy in-place, without ever involving a local value.
        unsafe { target.copy_from_nonoverlapping(self, 1) };
    }
}

Based on https://stackoverflow.com/questions/31012923/what-is-the-difference-between-copy-and-clone, Doc for clone and many random comments. Clone do either shallow copy or deep copy.

Differs from Copy in that Copy is implicit and an inexpensive bit-wise copy, while Clone is always explicit and may or may not be expensive

I think types like &T, Rc, Arc are shallow copy (clone the pointer only not the underlying data)

alamb commented 4 months ago

I think, replacing Box usages with Arc under the enum Expr would improve performance.

I basically agree with this assessment but I have an alternate proposal for how to improve performance

I agree that the current clone() of Exprs (and Box::clone does a deep copy)
I believe that the cloneing of Exprs takes a significant amount of planning time in DataFusion (I was looking at cargo bench --bench sql_planner the other day and substantial amounts of time are spent cloning
I think the "ideal" rust pattern for rewriting Exprs is what we have in our ExprRewriters -- take an owned Expr and then destructure / rewrite it as needed
There are many places in our code where Expr::clone() is called Unecessirly
If we changed Expr to use Arc instead of Box I think we would get a planning speedup, but it would be less performant than the ideal pattern (as rewritten nodes would likely rewrite copying, much like LogicalPlan), and it would be a large API change for downstream consumers

Here is an example of what I think is a good pattern (there are no copies except when needed)

https://github.com/apache/arrow-datafusion/blob/9d0c05b9c2e5f9e774ab1ea08599ac8d4b23c93f/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L638-L648

Here is an example of where Expr cloning is being used unnecessarily

https://github.com/apache/arrow-datafusion/blob/9d0c05b9c2e5f9e774ab1ea08599ac8d4b23c93f/datafusion/optimizer/src/simplify_expressions/inlist_simplifier.rs#L45-L49

Thus my suggestion is to go through the planner and remove the calls to Expr::clone() as much as possible (likely letting cargo bench --bench sql_planner be our guide.

This would avoid any changes required for downstream consumers

mustafasrepo commented 4 months ago

Thanks @alamb for your answer. I also think that removing existing unnecessary .clones, replacing them with owned variants is a better approach with keeping the type as is. If you have some preliminary findings such as "Rule x has a lot of .clone with lots of overhead." We can prioritize those sections.

alamb commented 4 months ago

I did some profiling locally by running the following

cargo bench --bench sql_planner -- physical_plan_tpch_all

My analysis is that almost 40% of the planning time is spent in SimplifyExprs and CommonSubexprEliminate

I suspect there are many ways we could reduce clones in those passes

jayzhan211 commented 4 months ago

I'm interested in optimizing these (avoid clones), btw What is this profiling tool?

alamb commented 4 months ago

I'm interested in optimizing these (avoid clones),

❤️

btw What is this profiling tool?

I used Instruments (CPU profiler) that comes as part of XCode on Mac OSX

I have also used hotspot for Linux https://github.com/KDAB/hotspot which has similar capabilities

Maybe I should make a video about "how to profile / interpret stack traces to optimize DataFusion" 🤔

alamb commented 4 months ago

BTW I think https://github.com/apache/arrow-datafusion/issues/9140 would be a good first start as the inlist simplifier both uses clones as well as does (yet another) tree walk

Maybe we could port it over into the main ExprSimplifer loop in a few PRs 🤔

comphead commented 4 months ago

Maybe I should make a video about "how to profile / interpret stack traces to optimize DataFusion" 🤔

Would be great: video or screens, whatever works and we can attach it to DF docs

alamb commented 4 months ago

Would be great: video or screens, whatever works and we can attach it to DF docs

I will try and do this over the next week or so

alamb commented 4 months ago

I thought about this challenge last night and wrote up my thoughts here: https://github.com/apache/arrow-datafusion/issues/9637

alamb commented 4 months ago

Would be great: video or screens, whatever works and we can attach it to DF docs

I will try and do this over the next week or so

@jayzhan211 and @comphead here is a video showing what I do to profile datafusion: https://youtu.be/P3dXH61Kr5U -- do you think it is worthwhile adding to the docs?

comphead commented 4 months ago

Would be great: video or screens, whatever works and we can attach it to DF docs

I will try and do this over the next week or so

@jayzhan211 and @comphead here is a video showing what I do to profile datafusion: https://youtu.be/P3dXH61Kr5U -- do you think it is worthwhile adding to the docs?

I'd say its great, thanks @alamb, the font not always clear though, but the video gives the understanding what should be happening.

Today/tomorrow I'm planning to add a profiling doc for MacOS only, how to do a profiling and build flamegraphs and also include this Youtube link. Unix and Window related contributors can add their part later

comphead commented 4 months ago

https://github.com/apache/arrow-datafusion/pull/9711

jayzhan211 commented 4 months ago

Would be great: video or screens, whatever works and we can attach it to DF docs

I will try and do this over the next week or so

@jayzhan211 and @comphead here is a video showing what I do to profile datafusion: https://youtu.be/P3dXH61Kr5U -- do you think it is worthwhile adding to the docs?

@alamb Thanks for your video, I think it is really helpful.

apache / datafusion