DynamoRIO mixes Intel AVX and Intel SSE code

andrea-brunato-ARM commented 4 years ago

As mentioned on Intel optimization manual in "Mixing AVX code with SSE code" section:

If software inter-mixes AVX and SSE instructions without using VZEROUPPER properly, it can experience an AVX/SSE transition penalty

I have noticed that this causes a significant performance regression when running DynamoRIO on Intel Skylake (model 85)

For instance, taking into account an application which performs a simple for loop, compiled with -ftree-vectorize -O3:

int N=10000;

__attribute__((noinline)) void vector_triad(const double* __restrict  a, const double* __restrict b, double* __restrict c, double scale){
        //__asm__("vzeroupper");
        for(int i=0; i<N; i++){
                c[i] = a[i] + scale * b[i];
        }
        return;
}

int main(int argc, char* argv[]){
        double* a = new T[N];
        double* b = new T[N];
        double* c = new T[N];
        for(int j=0; j<1000000; j++){
                vector_triad(a,b,c,2.0);
        }
        return 0;
}

I can notice the following execution times:

time app --> 0m2.844s

time drrun -- app --> 0m14.376s

When instead running the same application adding the "vzeroupper" command (reported in the code as comment):

time drrun -- app --> 0m3.225s

Is there a fix that could be implemented inside DynamoRIO in order to prevent such slowdown?

derekbruening commented 4 years ago

Thank you for the report. DR's code is not interleaved in steady state but the false dependence must be permanently there in this case? Is the only point DR could insert vzeroupper that would have impact after the SIMD restores in fcache_enter, where it is unsafe? It's not clear whether there is a practical solution: even with analyzing app code, there could be entrances from DR at any point. If the entire app never uses the top 128 you could imagine a solution, with a big flush if the app ever does: and a mixed app would itself use vzeroupper? Though maybe not in enough places. Hmm.

@hgreving2304 could you take a closer look and see what's feasible?

hgreving2304 commented 4 years ago

Previously we came to the conclusion that it is not feasible. The application above should not incur any AVX/SSE mix that is introduced by DR, if it runs in steady state in the code cache most of its time. As @derekbruening has pointed out, AVX is only introduced upon context switches when leaving/entering the code cache. Also possible for clean calls and signal handling, but this doesn't seem to be the cause here? Some places mention that Skylake does not incur any expensive state transition penalty by the way. Sth doesn't make sense with these numbers above, in particular on Skylake. But even if they were correct, there is likely no way for us to add vzeroupper instructions on context switches since as a binary translation system, we don't know much about the liveness of the registers at an arbitrary bb. We could mitigate by adding a lazy AVX detection as we already do with AVX-512. I'll have a closer look.

hgreving2304 commented 4 years ago

We have found the likely reason: See manual xref 14.3 figure 14.2. Looks like in Skylake, we incur a bigger penalty even though DR adds only a one-time context switch before the app is in steady state.

algrant-arm commented 4 years ago

For future reference, the symptom that pinpointed this a problem was a huge increase in UOPS_ISSUED.VECTOR_WIDTH_MISMATCH: "Counts the number of Blend Uops issued by the Resource Allocation Table (RAT) to the reservation station (RS) in order to preserve upper bits of vector registers. Starting with the Skylake microarchitecture, these Blend uops are needed since every Intel SSE instruction executed in Dirty Upper State needs to preserve bits 128-255 of the destination register."

The problem is that this is not a transition penalty, it is a penalty every time you execute an SSE instruction in Dirty Upper State. If the code cache contains SSE (which DynamoRIO ought to know) then it should be entered in Clean Upper State.

hgreving2304 commented 4 years ago

Yes, it's a curiosity of Skylake. Architectures before ("one time" per-transition cost) and after (almost no cost) don't have this problem. We decided to mitigate this by adding vzeroupper that lazily will be turned off once AVX has been encountered. This will at least fix SSE only applications running on Skylake. Thanks for bringing this to our attention.

hgreving2304 commented 4 years ago

@derekbruening , I have a prototype that inserts vzeroupper lazily patching this with nop once AVX has been detected. The patch has no effect on example above and on any other SSE-only app as long as you're running on a machine with support for AVX, because it looks like even libc introduces AVX. Even -mno-avx doesn't seem to prevent this. You'd probably had to fake cpuid in order to avoid libc runtime dispatch to avx code. The libc code is using proper vzeroupper instructions, so they can do that. Considering this, should we even bother to add the lazy detection? It seems like it almost always will have no effect.

hgreving2304 commented 4 years ago

I think the only mitigation I could think of (discussed offline) is to

instrument every bb that contains VEX instructions with a write that enables a flag
vzeroupper always ends a block
instrument every vzeroupper-ending block with clearing a flag
flag is a conditional in context switching code that jumps over saving AVX state

derekbruening commented 4 years ago

Agreed, there does not seem to be any good solution. I'm curious whether other DBT systems have the same problem and whether any have a solution.

algrant-arm commented 4 years ago

Have you looked at using XSAVEC / XRSTOR? That way if the top halves XMM0-15_H are in 'init' state when saved, they are restored to 'init' state because XRSTOR will see that XSTATE_BV[2] is 0. But if they have been used by application code then they will be restored.

So in the unusual case where the top 128 bits does have live non-zero data on execution of an SSE instruction and the application is relying on it being preserved (with a performance penalty) you wouldn't lose it by executing VZEROUPPER.

hgreving2304 commented 4 years ago

Thanks, the instructions are interesting. But I don't see how this helps if there is application code as in the libc example above that executes AVX. How do we know whether this state is live or not when entering the code cache? Could you make an example how this helps the case above?

algrant-arm commented 4 years ago

I think you don't need to know whether the top-128 state is live or not. If the contents are non-zero then XSAVEC/XRSTOR will preserve them. If SSE code is executed next then it will take the performance hit from Dirty Upper State but it would have done that anyway if the top-128 contents were non-zero. On the other hand, if it's in Clean Upper State XSAVEC/XRSTOR will preserve Clean Upper State - unlike saving and restoring the registers explicitly with VMOVUPD, which even if the top halves are zero, will take you from Clean Upper State to Dirty Upper State. With XSAVE/XRSTOR you don't have to worry about liveness the way you do if you are zeroing the registers with VZEROUPPER.

The other option would be to inspect the top 128-bits of all the XMM registers and if (and only if) all zero, then issue VZEROUPPER. It's safe because you aren't changing the values (so it doesn't matter if they are live) and it puts you back into Clean Upper State if that's how you were on the way in.

hgreving2304 commented 4 years ago

I think you don't need to know whether [..]

Ok, sound like this could work. Need to check if e.g. xsavec actually writes ymmh zeroes if in "clean state", such that mcontext reflects this.

The other option would be to inspect the top 128-bits of all the XMM registers [..]

I think this would theoretically work, too.

Thanks for the ideas!

hgreving2304 commented 4 years ago

Offline talk:

Option 1) XSAVE/XRSTOR:

can xrstor restore into a "clean state", e.g. if DR or client are built with AVX code (in addition to ctx switching) making the state dirty?
does xsave update the state w/ zeroes if state was "clean", or does it write nothing?

Option 2) inspecting regs:

what's the cost for doing this vs. what's being saved? Skylake/non-Skylake

derekbruening commented 4 years ago

Offline talk:

Option 1) XSAVE/XRSTOR:

can xrstor restore into a "clean state", e.g. if DR or client are built with AVX code (in addition to ctx switching) making the state dirty?

Right, this is my confusion here: DR and the client are going to use AVX code, so unless XSAVEC is saving the state bits describing the ymmh dirty/clean state and XRSTOR completely ignores the actual hardware state and looks at the dirty/clean state saved by XSAVEC, I don't see how this works (unless DR and all clients are built without any AVX code whatsoever, which I don't think is reasonable).

algrant-arm commented 4 years ago

so unless XSAVEC is saving the state bits describing the ymmh dirty/clean state

I believe that's effectively that's what it does. You ask XSAVE to save XMM and YMM_H. It spots that YMM_H is in Clean Upper State and sets XSTATE_BV[1] but not XSTATE_BV[2]. I.e. it is asserting that YMM_H is not present in the save area. Then when you do XRSTOR asking it to restore XMM and YMM_H it looks at XSTATE_BV[2], sees that it wasn't saved, and initializes YMM_H, restoring it to Clean Upper State.

Conversely, if YMM_H was in use, the exact same XSAVE would save YMM_H and set XSTATE_BV[2] and then the data would be restored by XSTOR.

As far as I know, this is what the kernel does on a timeslice context switch - on Skylake we don't see a process using SSE suddenly run much slower just because it's been timesliced with a process using AVX, and that's because the kernel uses XSAVE / XRSTOR to manage contexts.

hgreving2304 commented 4 years ago

Makes sense. Only potential problem might be that if the state is clean and xsave does not save it, I wonder if we could left with invalid outdated state in DynamoRIO's structures, i.e. would be need to save the XSTATE_BV[2] in order to know that the ymmh state is zero.

Other than that, this sounds like the way to go.

derekbruening commented 4 years ago

Back when we added initial AVX support, we considered using XSAVE. We found that it was 5x (!) slower than individually saving the ymm registers. That was a long time ago, but we would need to measure the perf here to figure out in which scenarios it's a win.

derekbruening commented 4 years ago

Re: modeling our switches on what the kernel does: we have many cases of very frequent context switches compared to the kernel, so we can't necessarily afford to do things the same way. A client with non-inlined clean calls may be saving and restoring SIMD registers very frequently.

algrant-arm commented 4 years ago

_Only potential problem might be that if the state is clean and xsave does not save it, I wonder if we could left with invalid outdated state in DynamoRIO's structures, i.e. would be need to save the XSTATEBV[2] in order to know that the ymmh state is zero.

XSTATE writes the zeroes anyway (I experimented with an XSAVE block initialized to 0xCC's), so the state is correct for clients that want to inspect the application register values, but it will not set XSTATE_BV[2] because YMM_H is not in use. What this means though is if a client wants to update a YMM_H register as seen by the application, by writing a non-zero value into DR's context, then it will need to set XSTATE_BV[2], otherwise XRSTOR will initialize YMM_H to all zeroes.

There's an XSAVEOPT which is even more lazy and doesn't store values if it thinks they haven't changed value since it last saved them in the same save block, but even with that you might be safe as it would store them first time round.

hgreving2304 commented 4 years ago

By the way, slightly related topic, have you tried to get the extended AVX-512 state with xsave in 32-bit mode? While it is straight forward to query the component offsets with cpuid for 64-bit (we do this in DynamoRIO, see xstate_area_zmm_hi256_offs, xstate_area_hi16_zmm_offs, xstate_area_kmask_offs), the same doesn't seem to be true for 32-bit.

hgreving2304 commented 4 years ago

FTR: understanding this a bit better: we had to use xsavec (compacted version). This was pointed out above but not clear to me until now. This implies that there may be an issue with older processors that don't support the compacted format. i.e. is there a processor that supports AVX but no xsavec.

XSTATE writes the zeroes anyway

I can't confirm this. I ran some experiments (easy to get wrong because libc++ and even printf seems to introduce vzeroupper) and it looks to me that xsavec clearly does not write the ymmh area if XINUSE[2] == 0. XSTATE_BV[2] is zero in this case in the xstate header. The context switch code could read this bit and re-write ymmh. The header had to be saved anyway for the xrstor later. So it is still doable functionally.

Question is performance.

derekbruening commented 4 years ago

This implies that there may be an issue with older processors that don't support the compacted format. i.e. is there a processor that supports AVX but no xsavec.

According to DR's decode_table.c https://github.com/DynamoRIO/dynamorio/blob/master/core/arch/x86/decode_table.c#L1269, XSAVEC was added in Skylake, so the answer is definitely yes, there are many processors with AVX but not XSAVEC. It sounds like we would have to have a split approach with special code just for Skylake+.

algrant-arm commented 4 years ago

Sorry for not being clear, I mentioned XSAVEC originally, but the one I tested as saving zeroes even in Clean Upper State is XSAVE. XSAVEC is more lazy, and wouldn't write back the zeroes, as well as using a more compact format.

If you have XSAVEC (as indicated by CPUID) then you could use either one. I think the only need to differentiate on model number would be for performance reasons. The specific performance hit from executing SSE in Dirty Upper State is apparently only a problem on Skylake models so you might choose to only use XSAVE/XSAVEC there - and use whichever one is more convenient.

hgreving2304 commented 4 years ago

Ok, can confirm both. Apparently xsave(64) does not do the "init optimization" that xsavec does, i.e. it always writes. There's also xsaveopt that on top of xsavec applies a "modified optimization". We may be even able to utilize the latter. So xsave(64) would work and it does always write the component, even if in init state, i.e. it writes zeroes. We could use xsavec as described above on newer processors as an optimization.

I ran some quick perf experiment with the vector_triad loop above, but with more iterations, and simulating Dynamorio fcache_enter/fcache_exit before after every call.

xsave64 + xrstor VM/not sure what processor: 40.3s Skylake: 26.0s xsavec64 + xrstor VM/not sure what processor: 40.1s Skylake: 25.7s Save/restore xmm + ymm state w/ (v)mov (on Skylake, this penalty scales with number of SSE instructions that incur the dirty state blending costs) VM/not sure what processor: 2m48.8s Skylake: 1m42.6s Save/restore xmm + ymm state w/ (v)mov + vzeroupper VM/not sure what processor: 43.1s Skylake: 25.6s For reference, w/o save/restore xmm + ymm state VM/not sure what processor: 37.9s Skylake: 23.1s

Looks like xsave is about the same performance as manually saving xmm + ymm, on my machines. I could only measure a tiny improvement w/ xsavec.

algrant-arm commented 4 years ago

Thanks, that's useful data.

Using XSAVEOPT might need some caution. It avoids writing data, even when non-zero and possibly live, if it thinks that it used the same save area last time, and the data is up to date. I.e. XSAVEOPT avoids saving when registers are "clean" in the sense that a cache line is clean. Crucially, it appears that "the same save area" appears to be the one with the same linear address, so if the OS shuffles virtual-to-linear mappings around under the process's feet, it might be that the same linear address happens to get used for two logically different XSTATEs.

For XSAVEC this isn't an issue because the "lazy" optimization is only about not saving or loading clean (zero) data, and the clean state is always represented by XSTATE_BV in the save area.

hgreving2304 commented 4 years ago

I think we should confirm that xsavec is not much faster than xsave for our use cases. Then we should just use xsave and consider breaking binary compatibility by re-arranging mcontext, so we can directly write with xsave into mcontext. Otherwise we will need to shuffle the data around every time.

derekbruening commented 4 years ago

The xsave* performance should be measured on other processor models b/c I suspect it is slower on pre-Skylake based on our past measurements. Let's not put sthg in place focusing on Skylake and slow down everything else.

hgreving2304 commented 4 years ago

Agreed.

DynamoRIO / dynamorio

DynamoRIO mixes Intel AVX and Intel SSE code #3935