apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.93k stars 1.12k forks source link

Efficiency Problem: Parallelization and vectorization #9547

Open Lordworms opened 6 months ago

Lordworms commented 6 months ago

Is your feature request related to a problem or challenge?

I was doing a course project on efficiency comparison. And I try on TPC-H benchmark to compare the efficiency between datafusion and duckDB. The results indicated that There might be some efficiency issues. I also noticed that the effective CPU use time of datafusion is much higher than DuckDB, but the runtime on TPC-H is slower(seems like we did not really do parallism and I really think that's some problem comes from Tokio) This is DuckDB's result 9a087441e7ecb79d01fc382d14f47ffe This is Datafusion's result 1980a8f73e043fff172c3763114110e3

Also the flame graph shows that datafusion has a much deeper stack. duckDB 1def3e3446638dbd5fd305db09421227

datafusion 26f898b9cbf0352ea76383ae1faf7d88

I kind of generated some distrust towards Tokio.

I doubt whether the slower performance is due to incomplete use of SIMD instruction so I did some statistics on SIMD instructions using PIN(may be the result is not that precise, but I expected the number of SIMD instruction generated should be comparable), the results shows below SIMD instruction datafusion number duckDB number
ADDSD 34 25
CMPSD_XMM 1 6
COMISD - 44
DIVSD 14 32
MAXSD 1 1
MULSD 21 52
PACKUSWB 5 7
PADDB 30 12
PADDD 100 33
PADDQ 291 200
PADDW 8 5
PCMPEQB 548 544
PCMPEQD 58 38
PCMPGTB - 1
PCMPGTD 44 14
PCMPGTW - 6
PMINUB 8 20
PMOVMSKB 1169 278
PMULHUW 1 2
PMULLW 1 2
PMULUDQ - 4
PSHUFD 646 88
PSLLD 6 2
PSLLDQ 72 217
PSLLQ 213 16
PSLLW 30 2
PSRAD 8 -
PSRLD 3 40
PSRLDQ 39 179
PSRLQ 11 7
PSUBB 84 243
PSUBD 4 3
PSUBQ 12 4
PSUBUSB - 6
PSUBW - 6
PUNPCKHBW 41 7
PUNPCKHDQ 45 66
PUNPCKHQDQ 102 14
PUNPCKHWD 42 50
PUNPCKLBW 211 19
PUNPCKLDQ 94 338
PUNPCKLQDQ 353 2713
PUNPCKLWD 73 80
ROUNDSD 1 -
SHUFPD 4 20
SHUFPS - 28
SQRTSD - 2
SUBSD 10 19
UCOMISD 16 39
VPCMPB 56 86
VPCMPUB 206 19
VPMINUB 2 15
Total 4851 5293

Turns out that datafusion may use less SIMD instructions than DuckDB (that might be the rustc problem)

Describe the solution you'd like

I plan to do this week after next after. But got no clues yet

Describe alternatives you've considered

No response

Additional context

No response

Lordworms commented 6 months ago

@alamb I am kinda stuck here, could you please provide some clues about this one? Thanks

yyy1000 commented 6 months ago

probably related: #5942

Lordworms commented 6 months ago

My current plan for this is to generate a vectorization instruction coverage in CI/CD to track the usage of SIMD instructions. Also I think tokio may got some bugs for this. Maybe start to add parallism for different operator. Probably starting with SCAN

alamb commented 6 months ago

Hi @Lordworms -- thank you for this analysis.

(seems like we did not really do parallism and I really think that's some problem comes from Tokio)

I do not agree with this statement in general (though it may be that TPCH parallelism could be improved), -- DataFusion uses a signfiicant amount of CPU / parallelism and while tokio results in more complicated stack traces for sure, I think overall the benfits are worth it.

We did a comparison of DataFusion and DuckDB in our upcoming SIGMOD paper (https://github.com/apache/arrow-datafusion/issues/6782) DataFusion_Query_Engine___SIGMOD_2024.pdf where we compared single core efficiency and scaling (see the results section). We found areas that each engine did better in.

If your goal is to improve the performance of DataFusion in the TPCH queries I have some thoughts:

  1. The TPCH benchmark has many large joins. Thus the efficiency of the both the join plans and the join operators (e.g. HashJoinExec) is important for good TPCH
  2. The level of optimization that has been invested into DataFusion joins is relatively low compared to aggregationing and filtering (see https://github.com/apache/arrow-datafusion/issues/8398 for a list of potential ideas)
Omega359 commented 6 months ago

I run DF on a c7i.48xlarge instance type in aws (192 cores, 384GB RAM) and during my processing I'm seeing almost 100% cpu usage across the board. So parallelism in my usecase is essentially perfect - though I can't speak for the efficiency.

image