Open Lordworms opened 6 months ago
@alamb I am kinda stuck here, could you please provide some clues about this one? Thanks
probably related: #5942
My current plan for this is to generate a vectorization instruction coverage in CI/CD to track the usage of SIMD instructions. Also I think tokio may got some bugs for this. Maybe start to add parallism for different operator. Probably starting with SCAN
Hi @Lordworms -- thank you for this analysis.
(seems like we did not really do parallism and I really think that's some problem comes from Tokio)
I do not agree with this statement in general (though it may be that TPCH parallelism could be improved), -- DataFusion uses a signfiicant amount of CPU / parallelism and while tokio results in more complicated stack traces for sure, I think overall the benfits are worth it.
We did a comparison of DataFusion and DuckDB in our upcoming SIGMOD paper (https://github.com/apache/arrow-datafusion/issues/6782) DataFusion_Query_Engine___SIGMOD_2024.pdf where we compared single core efficiency and scaling (see the results section). We found areas that each engine did better in.
If your goal is to improve the performance of DataFusion in the TPCH queries I have some thoughts:
HashJoinExec
) is important for good TPCHI run DF on a c7i.48xlarge instance type in aws (192 cores, 384GB RAM) and during my processing I'm seeing almost 100% cpu usage across the board. So parallelism in my usecase is essentially perfect - though I can't speak for the efficiency.
Is your feature request related to a problem or challenge?
I was doing a course project on efficiency comparison. And I try on TPC-H benchmark to compare the efficiency between datafusion and duckDB. The results indicated that There might be some efficiency issues. I also noticed that the effective CPU use time of datafusion is much higher than DuckDB, but the runtime on TPC-H is slower(seems like we did not really do parallism and I really think that's some problem comes from Tokio) This is DuckDB's result This is Datafusion's result
Also the flame graph shows that datafusion has a much deeper stack. duckDB
datafusion
I kind of generated some distrust towards Tokio.
Turns out that datafusion may use less SIMD instructions than DuckDB (that might be the rustc problem)
Describe the solution you'd like
I plan to do this week after next after. But got no clues yet
Describe alternatives you've considered
No response
Additional context
No response