cmuparlay / parlaylib

A Toolkit for Programming Parallel Algorithms on Shared-Memory Multicore Machines
MIT License
321 stars 60 forks source link

Portability on Mac ARM builds #42

Closed ldhulipala closed 1 year ago

ldhulipala commented 1 year ago

Running the benchmark using gives an error that clang doesn't support -march=native

cmake -DCMAKE_BUILD_TYPE=Release -DPARLAY_BENCHMARK=On ../..
...
[ 84%] Building CXX object benchmark/CMakeFiles/bench_delayed.dir/bench_delayed.cpp.o
clang: error: the clang compiler does not support '-march=native'
clang: error: clang: the clang compiler does not support '-march=native'error: the clang compiler does not support '-march=native'
clang: error: the clang compiler does not support '-march=native'

Removing -march=native and passing -mcpu=apple-m1 gets the code to compile. However, I noticed some non-termination and bus-errors when running on the newer macs.

Will update below with more concrete steps to reproduce any actual issues that pop up.

ldhulipala commented 1 year ago

Looks like adding "-arch x86_64" instead of "mcpu=apple-m1" fixes the issues I was observing. This does seem to add some non-trivial overhead, though---about 2x for an uncoarsened reduce implementation, so it's just measuring overhead in the scheduler. I get pretty much the same slowdown when turning all std::memory_order_relaxed to std::memory_order_seq_cst in scheduler.h (which also fixes the non-termination / bus-errors but is a hacky "fix").

DanielLiamAnderson commented 1 year ago

I'm a little confused. You say its an ARM machine, but using x86_64 fixes it?

The performance difference is not super surprising since atomic instructions on ARM are heavier in general than x86.

The bus errors might indicate that the memory orders in the scheduler are not correct on ARM (which technically means they're not correct at all, but they might work on x86 out of pure luck because of its stronger memory model). Its possible that fixing them might also help the performance.

ldhulipala commented 1 year ago

I think M1 has an x86_64 emulation layer, and the flags above generate an x86_64 binary. Some overhead could be coming from this emulation, but I couldn't find much documentation online about this and so this is pure speculation.

I agree that this points to some of the memory ordering instructions in the scheduler not being correct. Since the non-termination is very easy to trigger on M1 it would be a good platform to debug the scheduler.

gblelloch commented 1 year ago

I assume you tried with none of -arch x86_64, -march=native, -mcpu=apple-m1?

On Wed, Feb 15, 2023 at 5:32 PM Laxman Dhulipala @.***> wrote:

I think M1 has an x86_64 emulation layer, and so clang can target x86_64. Some overhead could be coming from this emulation, but I couldn't find much documentation online about this and so this is pure speculation.

I agree that this points to some of the memory ordering instructions in the scheduler not being correct. Since the non-termination is very easy to trigger on M1 it would be a good platform to debug the scheduler.

— Reply to this email directly, view it on GitHub https://github.com/cmuparlay/parlaylib/issues/42#issuecomment-1432153015, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTIVAE7FBTVZMST2XKXD7DWXVKOFANCNFSM6AAAAAAU4PIL4Y . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ldhulipala commented 1 year ago

Guy, thanks! It looks like that also works and yields an x86_64 binary (you can check this by running file binary_name): reduce: Mach-O 64-bit executable x86_64

DanielLiamAnderson commented 1 year ago

Could you please check whether the latest commit fixes the issue?

ldhulipala commented 1 year ago

The latest changes to the Deque struct fix the issues on my end. I'll reopen this if I run into further issues on M1.