apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 437 forks source link

[VL] coredump if compiled by g++-11 #7950

Open shuai-xu opened 1 week ago

shuai-xu commented 1 week ago

Backend

VL (Velox)

Bug description

I compile gluten with velox and then run it with spark. There are three machines in the test cluster. I find it always coredump on machine 2 and 3 while running normally on machine 1. The stack is:image。Then I compile gluten on another machine, and it runs normally on all the machines. After check the two machines, I find the main diff is the version of g++, I change g++ from g++-11 to g++-10, it works. The machines info is listed in System Info part.

Spark version

Spark-3.3.x

Spark configurations

No response

System information

Compile machine 1: image

Compile machine 2: image

Run machine 1 is the same as Compile 2. Run machine 2: image

Run machine 3: image

Relevant logs

No response

PHILO-HE commented 1 week ago

@shuai-xu, thanks for raising this issue!

I think we need to clean up the march setting with native flag. Could you also remove the below code to try again?

https://github.com/apache/incubator-gluten/blob/main/cpp/core/CMakeLists.txt#L27

shuai-xu commented 1 week ago

@shuai-xu, thanks for raising this issue!

I think we need to clean up the march setting with native flag. Could you also remove the below code to try again?

https://github.com/apache/incubator-gluten/blob/main/cpp/core/CMakeLists.txt#L27

@PHILO-HE After removing this line, it does not coredump.

PHILO-HE commented 6 days ago

@shuai-xu, thanks so much for your feedback!

As @zhouyuan told me, the newer gcc (e.g., gcc-11) makes full use of native cpu's instruction and optimization when -march=native is specified. But this can make the binary (which is compiled for your relatively new cpu architecture) not runnable on some old cpu architectures (in your case, it's avx2 cpu). So essentially, this is not gcc-11's issue. It's caused by our -march setting. It may be not rare that diverse cpu architectures coexist in users' cluster. So maybe, generic setting for compiler should be used by default in our code.

cc @zhouyuan, @FelixYBW

FelixYBW commented 6 days ago

@PHILO-HE Let's add -mno-avx512f to gluten cpp compile flags. It's used by Velox as well. It can solve the issue fundamentally.

@shuai-xu when you compile gluten and velox using march=native, which means your gcc optimizes binary for the machine you are building Gluten. Not the worker machine. To get best performance, you may set the march=ivybridge or any other machine type for your worker machine.

shuai-xu commented 4 days ago

@FelixYBW Thank you for explaining, learn a lot.

surnaik commented 9 hours ago

@PHILO-HE Let's add -mno-avx512f to gluten cpp compile flags. It's used by Velox as well. It can solve the issue fundamentally.

@shuai-xu when you compile gluten and velox using march=native, which means your gcc optimizes binary for the machine you are building Gluten. Not the worker machine. To get best performance, you may set the march=ivybridge or any other machine type for your worker machine.

But adding -mno-avx512f doesn't help, removing -march=native or setting march to build for just avx2 could be the solution