Efficiency Problem: Parallelization and vectorization

Lordworms commented 6 months ago

Is your feature request related to a problem or challenge?

I was doing a course project on efficiency comparison. And I try on TPC-H benchmark to compare the efficiency between datafusion and duckDB. The results indicated that There might be some efficiency issues. I also noticed that the effective CPU use time of datafusion is much higher than DuckDB, but the runtime on TPC-H is slower(seems like we did not really do parallism and I really think that's some problem comes from Tokio) This is DuckDB's result This is Datafusion's result

Also the flame graph shows that datafusion has a much deeper stack. duckDB

datafusion

I kind of generated some distrust towards Tokio.

I doubt whether the slower performance is due to incomplete use of SIMD instruction so I did some statistics on SIMD instructions using PIN(may be the result is not that precise, but I expected the number of SIMD instruction generated should be comparable), the results shows below	SIMD instruction	datafusion number
ADDSD	34	25
CMPSD_XMM	1	6
COMISD	-	44
DIVSD	14	32
MAXSD	1	1
MULSD	21	52
PACKUSWB	5	7
PADDB	30	12
PADDD	100	33
PADDQ	291	200
PADDW	8	5
PCMPEQB	548	544
PCMPEQD	58	38
PCMPGTB	-	1
PCMPGTD	44	14
PCMPGTW	-	6
PMINUB	8	20
PMOVMSKB	1169	278
PMULHUW	1	2
PMULLW	1	2
PMULUDQ	-	4
PSHUFD	646	88
PSLLD	6	2
PSLLDQ	72	217
PSLLQ	213	16
PSLLW	30	2
PSRAD	8	-
PSRLD	3	40
PSRLDQ	39	179
PSRLQ	11	7
PSUBB	84	243
PSUBD	4	3
PSUBQ	12	4
PSUBUSB	-	6
PSUBW	-	6
PUNPCKHBW	41	7
PUNPCKHDQ	45	66
PUNPCKHQDQ	102	14
PUNPCKHWD	42	50
PUNPCKLBW	211	19
PUNPCKLDQ	94	338
PUNPCKLQDQ	353	2713
PUNPCKLWD	73	80
ROUNDSD	1	-
SHUFPD	4	20
SHUFPS	-	28
SQRTSD	-	2
SUBSD	10	19
UCOMISD	16	39
VPCMPB	56	86
VPCMPUB	206	19
VPMINUB	2	15
Total	4851	5293

Turns out that datafusion may use less SIMD instructions than DuckDB (that might be the rustc problem)

Describe the solution you'd like

I plan to do this week after next after. But got no clues yet

Describe alternatives you've considered

No response

Additional context

No response

Lordworms commented 6 months ago

@alamb I am kinda stuck here, could you please provide some clues about this one? Thanks

yyy1000 commented 6 months ago

probably related: #5942

Lordworms commented 6 months ago

My current plan for this is to generate a vectorization instruction coverage in CI/CD to track the usage of SIMD instructions. Also I think tokio may got some bugs for this. Maybe start to add parallism for different operator. Probably starting with SCAN

alamb commented 6 months ago

Hi @Lordworms -- thank you for this analysis.

(seems like we did not really do parallism and I really think that's some problem comes from Tokio)

I do not agree with this statement in general (though it may be that TPCH parallelism could be improved), -- DataFusion uses a signfiicant amount of CPU / parallelism and while tokio results in more complicated stack traces for sure, I think overall the benfits are worth it.

We did a comparison of DataFusion and DuckDB in our upcoming SIGMOD paper (https://github.com/apache/arrow-datafusion/issues/6782) DataFusion_Query_Engine___SIGMOD_2024.pdf where we compared single core efficiency and scaling (see the results section). We found areas that each engine did better in.

If your goal is to improve the performance of DataFusion in the TPCH queries I have some thoughts:

The TPCH benchmark has many large joins. Thus the efficiency of the both the join plans and the join operators (e.g. HashJoinExec) is important for good TPCH
The level of optimization that has been invested into DataFusion joins is relatively low compared to aggregationing and filtering (see https://github.com/apache/arrow-datafusion/issues/8398 for a list of potential ideas)

Omega359 commented 6 months ago

I run DF on a c7i.48xlarge instance type in aws (192 cores, 384GB RAM) and during my processing I'm seeing almost 100% cpu usage across the board. So parallelism in my usecase is essentially perfect - though I can't speak for the efficiency.

apache / datafusion