Poor cache performance for large buffers because allocations are all page-aligned.

BrucePerens commented 3 years ago

This belongs in any programmer warnings document there might be for Crystal.

The Boehm garbage collector page-aligns all allocations over a hardware page in size. This potentially causes cache contention.

Assume there is a 2-way set-associative cache in the CPU, like the L1 cache in Intel i7 processors, and it hashes on the low address bits (which some CPUs do, Intel Sandy Bridge hashes on high bits and this has to be found out by reverse engineering). The programmer has an image processing algorithm using 3 operands, all using large allocated buffers. Every pixel will cause a cache spill, since there are 2 available cache lines and 3 memory references.

The solution is to stagger allocations by a cache line width (64 bytes on i7). This is something the programmer will need to do by over-allocating, and offsetting allocations by a random (or periodic) multiple of 64-byte offset within the first page.

rdp commented 3 years ago

How big is a hardware page?

BrucePerens commented 2 years ago

How big is a hardware page?

All of this requires tuning per architecture and even per CPU version. A hardware page is not even one number on architectures that support hugepages, like modern Intel CPUs. I decided the best solution would be to write a large buffer allocation facility so that users could do the right thing without an education in memory handling by modern CPUs. This would:

Not use GC or subject the buffer to GC scanning. Explicit deallocation would be required.
Use hugepages where available, to conserve translation-lookaside buffer slots.
Provide cache-line staggering of buffers start address, to prevent cache thrashing.
Allow tuning per CPU through a configuration file.

rdp commented 2 years ago

Is there an example showing the negative behavior? I wonder if we could tune the GC...

BrucePerens commented 2 years ago

My examples come from my history at Pixar. The issue is mostly not something you can do anything about with the GC, because GC pages are regular pages. Using hugepages means that your CPU will have many less TLB spills, because 1 TLB is used for a very large extent of memory rather than a small hardware page. You potentially spill all the TLBs while traversing an image buffer without hugepages. The Intel data sheet doesn't say how many there are, but they are a scarce resource.

The Intel CPU reference here: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/10th-gen-core-families-datasheet-vol-1-datasheet.pdf says there are 16 ways for the last level cache, and 8 ways for the first and second level caches. So, the cache staggering issue is not so big as when CPUs had only three ways, and any three-input one-output operation, like compositing with a mask, would continually spill the cache. But 3D rendering gets above 8 operands. So, your best defense from thrashing the cache is to not have image buffers start on the same cache line.

Finally, it takes time to traverse big image buffers with the GC scanner, and it's never going to hit anything in them. So, might as well not scan them at all, and you get that by not allocating them through the Boehm allocator.

On Sat, Dec 18, 2021 at 8:14 PM Roger Pack @.***> wrote:

Is there an example showing the negative behavior? I wonder if we could tune the GC...

— Reply to this email directly, view it on GitHub https://github.com/crystal-lang/crystal/issues/10444#issuecomment-997326724, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAINM356UPPFD3LU63CU743URVLZHANCNFSM4YHJGOYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

-- Bruce Perens K6BP

konovod commented 2 years ago

Finally, it takes time to traverse big image buffers with the GC scanner, and it's never going to hit anything in them. So, might as well not scan them at all, and you get that by not allocating them through the Boehm allocator.

This part is solved by GC.malloc_atomic - buffers allocated with it aren't scanned. Actually, it is already used in Crystal when allocating types that doesn't contain pointers, so shouldn't cause problems with images.

BrucePerens commented 2 years ago

Thanks for pointing that out. Unfortunately, I don't presently have a way to use it with hugepages. I think hugepages are allocated through a separate kernel facility. I've not implemented my shard for this yet, so I have yet to see.

On Sun, Dec 19, 2021, 3:08 AM kipar @.***> wrote:

Finally, it takes time to traverse big image buffers with the GC scanner, and it's never going to hit anything in them. So, might as well not scan them at all, and you get that by not allocating them through the Boehm allocator.

This part is solved by GC.malloc_atomic - buffers allocated with it aren't scanned. Actually, it is already used in Crystal when allocating types that doesn't contain pointers, so shouldn't cause problems with images.

— Reply to this email directly, view it on GitHub https://github.com/crystal-lang/crystal/issues/10444#issuecomment-997372428, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAINM34PCSGTEIYSHAIRXV3URW4JRANCNFSM4YHJGOYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

crystal-lang / crystal

Poor cache performance for large buffers because allocations are all page-aligned. #10444