Closed ldhulipala closed 1 year ago
Looks like adding "-arch x86_64" instead of "mcpu=apple-m1" fixes the issues I was observing. This does seem to add some non-trivial overhead, though---about 2x for an uncoarsened reduce implementation, so it's just measuring overhead in the scheduler. I get pretty much the same slowdown when turning all std::memory_order_relaxed to std::memory_order_seq_cst in scheduler.h (which also fixes the non-termination / bus-errors but is a hacky "fix").
I'm a little confused. You say its an ARM machine, but using x86_64
fixes it?
The performance difference is not super surprising since atomic instructions on ARM are heavier in general than x86.
The bus errors might indicate that the memory orders in the scheduler are not correct on ARM (which technically means they're not correct at all, but they might work on x86 out of pure luck because of its stronger memory model). Its possible that fixing them might also help the performance.
I think M1 has an x86_64 emulation layer, and the flags above generate an x86_64 binary. Some overhead could be coming from this emulation, but I couldn't find much documentation online about this and so this is pure speculation.
I agree that this points to some of the memory ordering instructions in the scheduler not being correct. Since the non-termination is very easy to trigger on M1 it would be a good platform to debug the scheduler.
I assume you tried with none of -arch x86_64, -march=native, -mcpu=apple-m1?
On Wed, Feb 15, 2023 at 5:32 PM Laxman Dhulipala @.***> wrote:
I think M1 has an x86_64 emulation layer, and so clang can target x86_64. Some overhead could be coming from this emulation, but I couldn't find much documentation online about this and so this is pure speculation.
I agree that this points to some of the memory ordering instructions in the scheduler not being correct. Since the non-termination is very easy to trigger on M1 it would be a good platform to debug the scheduler.
— Reply to this email directly, view it on GitHub https://github.com/cmuparlay/parlaylib/issues/42#issuecomment-1432153015, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTIVAE7FBTVZMST2XKXD7DWXVKOFANCNFSM6AAAAAAU4PIL4Y . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Guy, thanks! It looks like that also works and yields an x86_64 binary (you can check this by running file binary_name
):
reduce: Mach-O 64-bit executable x86_64
Could you please check whether the latest commit fixes the issue?
The latest changes to the Deque struct fix the issues on my end. I'll reopen this if I run into further issues on M1.
Running the benchmark using gives an error that clang doesn't support -march=native
Removing -march=native and passing -mcpu=apple-m1 gets the code to compile. However, I noticed some non-termination and bus-errors when running on the newer macs.
Will update below with more concrete steps to reproduce any actual issues that pop up.