crossbeam-rs / crossbeam

Tools for concurrent programming in Rust
Apache License 2.0
7.49k stars 470 forks source link

CachePadded doesn't work correctly (Mac M1) #1138

Closed xpepermint closed 2 months ago

xpepermint commented 2 months ago

I noticed that CachePadded doesn't work correctly on my MacBook M1, and this applies not just to AtomicBool but also to any atomic type (e.g., AtomicUsize, AtomicBool, AtomicU8, etc.). The issue arises in some strange multi-threaded cases, leading to unexpected behavior.

Initially, I thought the problem was caused by something in my code - possibly due to incorrect atomic ordering or a logic bug. After investigating further, I replaced CachePadded with the atomic type directly (e.g., AtomicBool or AtomicUsize), and false reading (load returning some strange number) effects disappeared. This suggests that the issue might not be with atomic operations themselves but with how CachePadded interacts with them on certain architectures.

I also experimented with the alignment of CachePadded. When I used #[repr(align(64))] to explicitly align CachePadded to a 64-byte boundary, the issue disappeared. This leads me to believe that the problem might be related to overalignment (e.g., aligning to 128 bytes instead of 64 bytes on certain architectures like ARM or Apple Silicon).

https://github.com/crossbeam-rs/crossbeam/blob/abf24e1f31a76ef2590688d0c2bb55f82c7410a9/crossbeam-utils/src/cache_padded.rs#L83

I'd appreciate any clarification on whether CachePadded should be aligned to 128 bytes for specific architectures like aarch64 (Apple M1), or if this alignment is unnecessary and potentially problematic. If padding atomic types like AtomicBool is required or recommended for preventing false sharing, I'd also be interested in understanding why aligning to 64 bytes seems to fix the issue.

Thanks for your help!

xpepermint commented 2 months ago

Btw, for x86_64 and powerpc64, I think we should align to their actual cache line size as well, which is typically 64 bytes for both architectures. The use of 128 bytes for alignment in certain contexts (like prefetching or spatial prefetcher behavior on Intel's Sandy Bridge) is an optimization for specific CPU behaviors, but this does not mean the cache line size itself is 128 bytes. So ... Any thoughts?

taiki-e commented 2 months ago

Thanks for the report! Could you tell us more specific information? (e.g., how to reproduce the problem or on what code exactly was tried and what happened.)

I replaced CachePadded with the atomic type directly (e.g., AtomicBool or AtomicUsize), and false reading (load returning some strange number) effects disappeared.

I don't think this necessarily indicates a problem with the CachePadded (or #[repr(align(128))]) itself, as I know of an instance where bug in alignment/size calculation (resulting loads returning strange value) has been detected by using CachePadded. (At first glance, the problem you have encountered looks similar to that...)

And, if we can really write unsound code with just #[repr(align(128))], then it needs to be reported as a bug of the compiler. (I've seen reports of strange things happening with thousands of alignment specified, but I've never seen a case for values that are not that large.)


Btw, for x86_64 and powerpc64, I think we should align to their actual cache line size as well, which is typically 64 bytes for both architectures. The use of 128 bytes for alignment in certain contexts (like prefetching or spatial prefetcher behavior on Intel's Sandy Bridge) is an optimization for specific CPU behaviors, but this does not mean the cache line size itself is 128 bytes. So ... Any thoughts?

My understanding is that as for the purpose of preventing false sharing (hardware_destructive_interference_size, CachePadded), there should be no problem with the alignment being larger than the actual cache line size. See also https://stackoverflow.com/questions/72126606/should-the-cache-padding-size-of-x86-64-be-128-bytes.

Another problem is that there is no way to reflect “actual size”. Dynamic alignment specification is not currently supported by Rust and the target-cpu based way is too complex.

workingjubilee commented 2 months ago

@xpepermint Was the code in question safe?

If it was unsafe, we need to see it, because we need to judge whether it is sound.

What do you mean by "false reading" returning a "strange number"?

This sounds like you were reading uninitialized memory as anything but MaybeUninit. That is unsound.

xpepermint commented 2 months ago

Ah, after more than a day of searching and dealing with the confusing results my (unsafe) code was throwing out, I finally figured out what I had completely overlooked. Removing CachePadded from the code or aligning to 64 fixed the problem, but it was so strange that I wasn't satisfied and went down the rabbit hole - the wrong one! :)

There's no point in explaining all the details, but due to the complexity of the problem, it remained hidden even after a full day of searching.

Anyway, I apologize for this, and thank you for the helpful links.