It's not write combining

travisdowns commented 5 years ago

Despite what the mechanical sympathy article says, it's not write combining that produces the effects you see in the write-combining tests. Neither Intel nor AMD use WC buffers for normal writes, only for "WC protocol" writes - which basically mean writes to WC memory and writes to WB memory ("normal" memory) using NT stores.

You could probe the WC behavior using NT stores. It still leaves open the question of what produces the effect you see.

Kobzol commented 5 years ago

Yeah, I thought that this one is a bit inconclusive, thanks for confirming this. I actually tried it with NT stores (it's commented in the code) and then the behavior was reversed.

I think it might be caused by the number of parallel streams that the hardware prefetcher can handle? Just a wild guess though.

Do you have a source for the WC buffer usage? I suppose it's in the Intel manual, but a page would be nice :-)

travisdowns commented 5 years ago

For Intel, I don't actually have a good reference. In fact, if you read the optimization manual, you will find information on WC buffers that seem to indicate that they still are used for regular writes to WB regions, but I believe this information is out of date (if it was ever correct in the first place).

The path to memory for regular writes is pretty well known: it first enters the store buffer, and at some point before leaving the store buffer, the associated cache line is read into L1 if it is not already there. This is known as a RFO (request for ownership) request. When the line is present in L1, stores drain one-by-one from the store buffer into the L1. The process of reading lines into the L1 so they can be stored uses the so-called line fill buffers which operate between L1 and the rest of the memory subsystem.

So no "write-combining buffers" in the traditional sense are used there. However, there is the complication that the buffers that do serve as write-combining buffers for WC memory and NT stores to WB memory are probably the same line fill buffers mentioned above, but used in a different way when WC is needed.

This is a good thread to read, and you can find similar threads on that forum.

All in all, I'll admit I'm only 90% sure here not 100%.

For AMD, you can find the text in their optimization manual:

2.13.3 Write-Combining Operations To improve system performance, AMD Family 17h processor aggressively combines multiple memory-write cycles of any data size that address locations within a 64-byte write buffer that is aligned to a cache-line boundary. The processor continues to combine writes to this buffer without writing the data to the system, as long as certain rules apply (see Table 2 for more information). The data sizes can be bytes, words, doublewords, or quadwords. • WC memory type writes can be combined in any order up to a full 64-byte write buffer. • All other memory types for stores that go through the write buffer (UC, WP, WT and WB) cannot be combined except when the WB memory type is over-ridden for streaming store instructions such as the MOVNTQ and MOVNTI instructions, etc. These instructions use the write buffers and will be write-combined in the same way as address spaces mapped by the MTTR registers and PAT extensions. When WCB is used for streaming store instructions, the buffers are subject to the same flushing events as write-combined address spaces.

So there at least it is clear that write combining applies only to WC memory and to WB memory when NT stores are used.

Kobzol commented 5 years ago

Thanks, that makes it clearer. I also think that this effect in general might be skewed by the compiler if it decides to do loop fission/fusion. I will change the example, so that it uses non-temporal stores and then I hope that it could be used to measure the number of WC buffers.

If I'm interpreting this correctly, this shows that on my CPU there are probably 8 WC buffers. When I write to 10 separate arrays, as long as I only write to 8 arrays at once, the time stays the same. However when I try to write to 9 arrays at once, the performance drops significantly. This is using NT stores.

Kobzol commented 5 years ago

When I tried a larger array with more increments, it starts to explode at 13, but that may also be caused by bandwidth issues. Also it's weird that increment 11 is consistently much faster than increment 9 and 10 (those two are slow for some reason). Any ideas?

It happens even if I manually write the loops, so I hope that it's not caused by compiler loop transformations nor the additional loop overhead when the write regions were used. VTune advises that the fill buffer is not up to speed with stride 9, but it's fine with stride 11.

travisdowns commented 5 years ago

If I'm interpreting this correctly, this shows that on my CPU there are probably 8 WC buffers. When I write to 10 separate arrays, as long as I only write to 8 arrays at once, the time stays the same. However when I try to write to 9 arrays at once, the performance drops significantly. This is using NT stores.

It's a reasonable assumption - there are 10 "fill buffers" on modern Intel, and as above the existing conventional wisdom is that these fill buffers do double duty as write-combining buffers. For older CPUs you can find a note that 2 of the WC buffers are "special" in they will be used by other memory accesses and not be dedicated to WC, so 10 - 2 = 8. At least, it is feasible that the CPU sets aside a couple of fill buffers for guaranteed non-WC use.

travisdowns commented 5 years ago

When I tried a larger array with more increments, it starts to explode at 13

Is this chart with NT stores or regular stores? With regular stores, it is possible you are blowing out the associated way of the cache (say 8 + 4 = 12 for L1 and L2), so you get thrashing in the cache.

Kobzol commented 5 years ago

Both charts are NT stores.

travisdowns commented 5 years ago

Both charts are NT stores.

I see. The magnitude of the change in performance is quite dramatic!

On my Skylake i7-6700HQ, I see a similar jump. Here's my output:

for i in {1..20}; do echo -n "$i "; ./write-combining 20 $i; done
1 12633
2 6665
3 4772
4 3565
5 3354
6 3501
7 3130
8 3169
9 6550
10 7592
11 5246
12 3424
13 35142
14 38449
15 40425
16 43727
17 48078
18 50478
19 53045
20 56264

I see a jump between 8 and 9, and also 12 is much faster than 11, and much faster than 13, similar to the effect you saw with 11.

NT stores are a bit mysterious, but yeah I think this bears more investigation.

Kobzol commented 5 years ago

I profiled it using VTune and it doesn't come close to maximum bandwidth, so I guess it's not caused by that. My guess would be that with 13 we are hitting the real limit of the WC buffers and for numbers below that either the CPU is good enough to cover for it or the compiler reorders the loop so that it doesn't write to that many arrays at once.

However I checked the code on Godbolt and the loop isn't even unrolled, so I don't think it's the compiler that's causing this. And with classic WB writes the increment 13 doesn't explode, so it actually could be caused by the number of WC buffers.

travisdowns commented 5 years ago

or the compiler reorders the loop so that it doesn't write to that many arrays at once.

I checked the assembly and the loop is pretty much faithful to what you wrote.

I don't think there are 13 WC buffers though, it seems too many to me. I could be something like below 13 the out-of-order window is large enough to look ahead enough that you still completely fill up the WC buffers, but at 13 and above you can't do that.

The test also includes the time for paging in the arrays in the first place, which will involve a bunch of page faults and so on, which is probably not something you want to include (although I tried populating the array before the timing loop locally and it doesn't change these results).

Kobzol commented 5 years ago

You're right, this program doesn't have many repetitions, so it won't be amortized that much. I modified it to zero the arrays on allocation.

It's still pretty weird that the increments 9 and 10 are slightly slower, but I have no idea how to explain that. I'm pretty happy with this image:

I'm gonna change the example so that it demonstrates the number of WC buffers with this chart (hopefully).

travisdowns commented 5 years ago

Earlier I wrote:

I don't think there are 13 WC buffers though

but of course it should say "12 WC buffers" since 13 is the first value at which the timing explodes.

In light of that, I'm going to take back my statement too: I was pretty sure there were only 10 fill buffers on Intel, but over the last half year I've seen a fair amount of evidence that there may actually be 12 fill buffers on Skylake. For example, the fill buffer occupancy counter l1d_pend_miss.pending shows values up to 12, although tends to average to 10. Also, when you estimate the maximum speedup from many parallel accesses like this you end up peaking around 12 parallel streams, not 10, and the max speedup is actually slightly above 10, which should be impossible if there were only 10 buffers.

I'll try running this on Haswell.

travisdowns commented 5 years ago

For Haswell, the "blowup" happens at 11, not 13:

 1  2242
 2  1174
 3   911
 4   764
 5   736
 6   760
 7   708
 8  1480
 9   922
10   867
11  6408 << boom!
12  7447
13  7335
14  7855
15  8674
16  8881
17  9753
18  9919
19 10457
20 10894

So I would say this probably suggests that Skylake has 12 WC buffers, two more than Haswell's 10. As an extension, it also suggests that Skylake also has 12 line fill buffers, per the theory that LFBs and WC buffers are the same thing.

Kobzol commented 5 years ago

Awesome, thanks for trying that out :) We can't be sure I guess, but it definitely seems plausible.

travisdowns commented 5 years ago

For what it's worth, and partly for my own bookeeping, here are my results with non-NT (i.e., "regular" writes) on an i7-6700HQ, with prefetch on:

and with prefetch off:

It is surprising to me that these illustrate a large jump at ~13 concurrent write streams, similar to the NT writes (except that the magnitude of the jump is ~3x rather than > 10x).

alexisfrjp commented 4 years ago

As an update, Skylake has only 10 LFBs : https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake

travisdowns commented 4 years ago

As an update, Skylake has only 10 LFBs : https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake

I've seen that, but I don't think it is correct. It is subject to error like any other source.

alexisfrjp commented 4 years ago

Compiled with RedHat/Centos devtoolset-9: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2):

Using CMAKE_BUILD_TYPE=Release is mandatory.

i5-9600K

i9-9900K

AMD Threadripper 3960x

Interesting to see the Threadripper with much fewer LFBs than much cheaper Intel CPUs.

travisdowns commented 4 years ago

Roger. How confident are you with the numbers you found?

Fairly confident, say in the 80% to 90% range (that is, to take a bet against it, I would need somewhere between 4:1 and 9:1 odds).

travisdowns commented 4 years ago

I have an off-topic (maybe) question:

Writting data sequentially, I experience delay between each flush.

I am not sure. I can imagine a lot of reasons for a stall when different memory types are accessed, but it would be very strange that the performance is very different when executed from kernel mode as opposed to user. One possible reason could be TLB effects: the kernel is using a totally different large-page mapping to cover virtual address space as opposed to userspace processes, so maybe there are page walks in the user case that don't occur in the kernel case? Are you 100% sure the memory type are configured the same way in both cases?

I think I recall a similar question recently on SO or intel software forums, maybe that was you? Probably a better place to discuss it than this github issue.

Kobzol / hardware-effects