Open mustafasrepo opened 4 months ago
~Is it possible to remove Box to lower the cost?~ It seems we need to deal with recursive types
Summary for my note:
Box<T>
does shallow copy or deep copy depends on T, not always deep copy.Rc<T> / Arc<T>
has overhead so whether they are faster than Box<T>
may need benchmarks to ensure that.~Is it possible to remove Box to lower the cost?~ It seems we need to deal with recursive types
Summary for my note:
- whether
Box<T>
does shallow copy or deep copy depends on T, not always deep copy.Rc<T> / Arc<T>
has overhead so whether they are faster thanBox<T>
may need benchmarks to ensure that.
Exactly. I also think that there are pros and cons of these approaches. For recursive types, I think deep clone problem is more evident.
It would be an interesting experiment, @jayzhan211 please elaborate how Box does shallow copy? I checked https://github.com/rust-lang/rust/blob/3cbb93223f33024db464a4df27a13c7cce870173/library/alloc/src/boxed.rs#L1306
It has its own .clone
for &str
and []
but I didnt see how the same type can be switched between shallow and deep copy.
It would be an interesting experiment, @jayzhan211 please elaborate how Box does shallow copy? I checked https://github.com/rust-lang/rust/blob/3cbb93223f33024db464a4df27a13c7cce870173/library/alloc/src/boxed.rs#L1306
It has its own
.clone
for&str
and[]
but I didnt see how the same type can be switched between shallow and deep copy.
I don't know how to demonstrate if it is shallow copy in rust playground
The summary is inferred that Box::clone is doing T::clone, see write_clone_into_raw
, it does self.clone()
which is T.clone()
#[cfg(not(no_global_oom_handling))]
/// Specialize clones into pre-allocated, uninitialized memory.
/// Used by `Box::clone` and `Rc`/`Arc::make_mut`.
pub(crate) trait WriteCloneIntoRaw: Sized {
unsafe fn write_clone_into_raw(&self, target: *mut Self);
}
#[cfg(not(no_global_oom_handling))]
impl<T: Clone> WriteCloneIntoRaw for T {
#[inline]
default unsafe fn write_clone_into_raw(&self, target: *mut Self) {
// Having allocated *first* may allow the optimizer to create
// the cloned value in-place, skipping the local and move.
unsafe { target.write(self.clone()) };
}
}
#[cfg(not(no_global_oom_handling))]
impl<T: Copy> WriteCloneIntoRaw for T {
#[inline]
unsafe fn write_clone_into_raw(&self, target: *mut Self) {
// We can always copy in-place, without ever involving a local value.
unsafe { target.copy_from_nonoverlapping(self, 1) };
}
}
Based on https://stackoverflow.com/questions/31012923/what-is-the-difference-between-copy-and-clone, Doc for clone and many random comments. Clone do either shallow copy or deep copy.
Differs from Copy in that Copy is implicit and an inexpensive bit-wise copy, while Clone is always explicit and may or may not be expensive
I think types like &T, Rc
I think, replacing Box
usages with Arc under the enum Expr would improve performance.
I basically agree with this assessment but I have an alternate proposal for how to improve performance
clone()
of Exprs (and Box::clone
does a deep copy)clone
ing of Exprs takes a significant amount of planning time in DataFusion (I was looking at cargo bench --bench sql_planner
the other day and substantial amounts of time are spent cloningExpr
and then destructure / rewrite it as neededExpr::clone()
is called UnecessirlyArc
instead of Box
I think we would get a planning speedup, but it would be less performant than the ideal pattern (as rewritten nodes would likely rewrite copying, much like LogicalPlan), and it would be a large API change for downstream consumersHere is an example of what I think is a good pattern (there are no copies except when needed)
Here is an example of where Expr cloning is being used unnecessarily
Thus my suggestion is to go through the planner and remove the calls to Expr::clone()
as much as possible (likely letting cargo bench --bench sql_planner
be our guide.
This would avoid any changes required for downstream consumers
Thanks @alamb for your answer. I also think that removing existing unnecessary .clone
s, replacing them with owned variants is a better approach with keeping the type as is. If you have some preliminary findings such as "Rule x has a lot of .clone
with lots of overhead." We can prioritize those sections.
I did some profiling locally by running the following
cargo bench --bench sql_planner -- physical_plan_tpch_all
My analysis is that almost 40% of the planning time is spent in SimplifyExprs and CommonSubexprEliminate
I suspect there are many ways we could reduce clones in those passes
I'm interested in optimizing these (avoid clones), btw What is this profiling tool?
I'm interested in optimizing these (avoid clones),
❤️
btw What is this profiling tool?
I used Instruments
(CPU profiler) that comes as part of XCode on Mac OSX
I have also used hotspot for Linux https://github.com/KDAB/hotspot which has similar capabilities
Maybe I should make a video about "how to profile / interpret stack traces to optimize DataFusion" 🤔
BTW I think https://github.com/apache/arrow-datafusion/issues/9140 would be a good first start as the inlist simplifier both uses clones as well as does (yet another) tree walk
Maybe we could port it over into the main ExprSimplifer
loop in a few PRs 🤔
Maybe I should make a video about "how to profile / interpret stack traces to optimize DataFusion" 🤔
Would be great: video or screens, whatever works and we can attach it to DF docs
Would be great: video or screens, whatever works and we can attach it to DF docs
I will try and do this over the next week or so
I thought about this challenge last night and wrote up my thoughts here: https://github.com/apache/arrow-datafusion/issues/9637
Would be great: video or screens, whatever works and we can attach it to DF docs
I will try and do this over the next week or so
@jayzhan211 and @comphead here is a video showing what I do to profile datafusion: https://youtu.be/P3dXH61Kr5U -- do you think it is worthwhile adding to the docs?
Would be great: video or screens, whatever works and we can attach it to DF docs
I will try and do this over the next week or so
@jayzhan211 and @comphead here is a video showing what I do to profile datafusion: https://youtu.be/P3dXH61Kr5U -- do you think it is worthwhile adding to the docs?
I'd say its great, thanks @alamb, the font not always clear though, but the video gives the understanding what should be happening.
Today/tomorrow I'm planning to add a profiling doc for MacOS only, how to do a profiling and build flamegraphs and also include this Youtube link. Unix and Window related contributors can add their part later
Would be great: video or screens, whatever works and we can attach it to DF docs
I will try and do this over the next week or so
@jayzhan211 and @comphead here is a video showing what I do to profile datafusion: https://youtu.be/P3dXH61Kr5U -- do you think it is worthwhile adding to the docs?
@alamb Thanks for your video, I think it is really helpful.
Is your feature request related to a problem or challenge?
No response
Describe the solution you'd like
According to following stackoverflow discussion.
Box
s can deep copy when called with.clone()
method (according to.clone()
implementation of the underlying type.).For
Box<Expr>
this is the case. I think this usage might be the reason of some deep stack usages seen during the planning.See related issues: #9375, #8837.
I think, replacing
Box<Expr>
usages withArc<Expr>
under theenum Expr
would improve performance. I am not familiar with the implications of these two approaches in other places. I wonder what community thinks about this change. Would it be better, unnecessary, etc?Describe alternatives you've considered
No response
Additional context
No response