Open shuai-xu opened 1 week ago
@shuai-xu, thanks for raising this issue!
I think we need to clean up the march setting with native
flag. Could you also remove the below code to try again?
https://github.com/apache/incubator-gluten/blob/main/cpp/core/CMakeLists.txt#L27
@shuai-xu, thanks for raising this issue!
I think we need to clean up the march setting with
native
flag. Could you also remove the below code to try again?https://github.com/apache/incubator-gluten/blob/main/cpp/core/CMakeLists.txt#L27
@PHILO-HE After removing this line, it does not coredump.
@shuai-xu, thanks so much for your feedback!
As @zhouyuan told me, the newer gcc (e.g., gcc-11) makes full use of native cpu's instruction and optimization when -march=native
is specified. But this can make the binary (which is compiled for your relatively new cpu architecture) not runnable on some old cpu architectures (in your case, it's avx2 cpu).
So essentially, this is not gcc-11's issue. It's caused by our -march
setting. It may be not rare that diverse cpu architectures coexist in users' cluster. So maybe, generic setting for compiler should be used by default in our code.
cc @zhouyuan, @FelixYBW
@PHILO-HE Let's add -mno-avx512f to gluten cpp compile flags. It's used by Velox as well. It can solve the issue fundamentally.
@shuai-xu when you compile gluten and velox using march=native, which means your gcc optimizes binary for the machine you are building Gluten. Not the worker machine. To get best performance, you may set the march=ivybridge or any other machine type for your worker machine.
@FelixYBW Thank you for explaining, learn a lot.
@PHILO-HE Let's add -mno-avx512f to gluten cpp compile flags. It's used by Velox as well. It can solve the issue fundamentally.
@shuai-xu when you compile gluten and velox using march=native, which means your gcc optimizes binary for the machine you are building Gluten. Not the worker machine. To get best performance, you may set the march=ivybridge or any other machine type for your worker machine.
But adding -mno-avx512f
doesn't help, removing -march=native
or setting march
to build for just avx2
could be the solution
Backend
VL (Velox)
Bug description
I compile gluten with velox and then run it with spark. There are three machines in the test cluster. I find it always coredump on machine 2 and 3 while running normally on machine 1. The stack is:。Then I compile gluten on another machine, and it runs normally on all the machines. After check the two machines, I find the main diff is the version of g++, I change g++ from g++-11 to g++-10, it works. The machines info is listed in System Info part.
Spark version
Spark-3.3.x
Spark configurations
No response
System information
Compile machine 1:
Compile machine 2:
Run machine 1 is the same as Compile 2. Run machine 2:
Run machine 3:
Relevant logs
No response