StanfordPL / stoke

STOKE: A stochastic superoptimizer and program synthesizer
http://stoke.stanford.edu
Apache License 2.0
740 stars 75 forks source link

Build broken #713

Closed stefanheule closed 9 years ago

stefanheule commented 9 years ago

It seems that the nehalem build is broken, we should fix this. Example failure

Instruction: vpmaxsd -0x7(%r11d,%ebx,4), %ymm15, %ymm7
  Maybe read set: { %ebx %r11d %ymm15 }
  Must write set: { %ymm7 }

Bits 128..192 of %ymm7 differ.
  In state 1: 0x64e3f48aa1a862a9
  In state 2: 0x1b4f3e7450dd852d
Bits 192..256 of %ymm7 differ.
  In state 1: 0x741527e7435279f3
  In state 2: 0x93053165396199e1

To me it looks like this instruction should never have been proposed in the first place in the fuzzer, because it's not available on nehalem. The culprit seems to be line 57 in tests/fuzzer.h:

tp.set_flags(CpuInfo::get_flags());

Instead, this should depend on whatever architecture the test was build for. Do we have a way to do that? Or should we pass -DNEHALEM_BUILD or similar to the compiler and have an ifdef guard?

Or am I completely wrong?

bchurchill commented 9 years ago

That sounds about right. I think this is something we don't have a standard solution for. In some places we have checks using built-ins like #ifdef AVX or #ifdef AVX2, but I've found things like this are flaky.

Maybe we could build some custom logic into CpuInfo so that get_flags() does a logical AND with a set of flags that we specify for each build? Then we only need to worry about this in one place? I think your suggestion -DNEHALEM_BUILD or similar is good.

On 09/18/2015 03:47 PM, Stefan Heule wrote:

It seems that the nehalem build is broken, we should fix this. Example failure

|Instruction: vpmaxsd -0x7(%r11d,%ebx,4), %ymm15, %ymm7 Maybe read set: { %ebx %r11d %ymm15 } Must write set: { %ymm7 }

Bits 128..192 of %ymm7 differ. In state 1: 0x64e3f48aa1a862a9 In state 2: 0x1b4f3e7450dd852d Bits 192..256 of %ymm7 differ. In state 1: 0x741527e7435279f3 In state 2: 0x93053165396199e1 |

To me it looks like this instruction should never have been proposed in the first place in the fuzzer, because it's not available on nehalem. The culprit seems to be line 57 in tests/fuzzer.h:

tp.set_flags(CpuInfo::get_flags());

Instead, this should depend on whatever architecture the test was build for. Do we have a way to do that? Or should we pass -DNEHALEM_BUILD or similar to the compiler and have an ifdef guard?

Or am I completely wrong?

— Reply to this email directly or view it on GitHub https://github.com/StanfordPL/stoke/issues/713.

stefanheule commented 9 years ago

Ok, glad that I'm understanding this correctly. AVX should be fine if we are concerned about the STOKE binary (i.e. what instruction the binary uses), but we should use something like -DNEHALEM_BUILD to decide what STOKE can propose, etc. It's not totally clear to me why the build is suddenly failing on this when it wasn't before, but this seems like the right fix in any case.

bchurchill commented 9 years ago

It's probably my fault to be honest. I probably introduced this when I refactored all the fuzzers to use some common code. I think before there was some special logic (I forget what) to handle this situation.

On 09/18/2015 04:02 PM, Stefan Heule wrote:

Ok, glad that I'm understanding this correctly. AVX should be fine if we are concerned about the STOKE binary (i.e. what instruction the binary uses), but we should use something like -DNEHALEM_BUILD to decide what STOKE can propose, etc. It's not totally clear to me why the build is suddenly failing on this when it wasn't before, but this seems like the right fix in any case.

— Reply to this email directly or view it on GitHub https://github.com/StanfordPL/stoke/issues/713#issuecomment-141589784.

stefanheule commented 9 years ago

I can implement this, but I don't quite know which flags are not available on which architecture. Do you know (or know where to look)? These are the flags we have:

FPU =       0x0000000000000001, // On-board FPU
TSC =       0x0000000000000002, // RDTSC
MSR =       0x0000000000000004, // RDMSR WRMSR
CX8 =       0x0000000000000008, // CMPXCHG8
SEP =       0x0000000000000010, // SYSENTER SYSEXIT
CMOV =      0x0000000000000020, // CMOV FCMOV
CLFLUSH =   0x0000000000000040, // CLFLUSH
MMX =       0x0000000000000080, // Multimedia Extensions
FXSR =      0x0000000000000100, // FXSAVE FXRSTOR
SSE =       0x0000000000000200, // Intel SSE Vector Instructions
SSE2 =      0x0000000000000400, // SSE2
SYSCALL =   0x0000000000000800, // SYSCALL SYSRET
RDTSCP =    0x0000000000001000, // RDTSCP
REP_GOOD =  0x0000000000002000, // REP Microcode works well
NOPL =      0x0000000000004000, // NOPL (0f 1f) Instructions
PNI =       0x0000000000008000, // SSE3 (Prescott New Instructions)
PCLMULQDQ = 0x0000000000010000, // PCLMULQDQ
MONITOR =   0x0000000000020000, // MONITOR MWAIT
SSSE3 =     0x0000000000040000, // Supplemental SSE-3
FMA =       0x0000000000080000, // Fused Multiply-Add
CX16 =      0x0000000000100000, // CMPXCHG16B
SSE4_1 =    0x0000000000200000, // SSE-4.1
SSE4_2 =    0x0000000000400000, // SSE-4.2
MOVBE =     0x0000000000800000, // MOVBE
POPCNT =    0x0000000001000000, // POPCNT
AES =       0x0000000002000000, // AES instructions
XSAVE =     0x0000000004000000, // XSAVE XRSTOR XSETBV XGETBV
AVX =       0x0000000008000000, // Advanced Vector Extensions
F16C =      0x0000000010000000, // 16-bit FP Conversions
RDRAND =    0x0000000020000000, // RDRAND
LAHF_LM =   0x0000000040000000, // LAHF SAHF (in long mode)
ABM =       0x0000000080000000, // Advanced Bit Manipulation
XSAVEOPT =  0x0000000100000000, // Optimized XSave
FSGSBASE =  0x0000000200000000, // {RD/WR}{FS/GS}BASE
BMI1 =      0x0000000400000000, // 1st Group Bit-manipulation Extensions
HLE =       0x0000000800000000, // Hardware Lock Ellision
AVX2 =      0x0000001000000000, // AVX2 Instructions
BMI2 =      0x0000002000000000, // 2nd Group Bit-manipulation Extensions
ERMS =      0x0000004000000000, // Enhanced REP MOVSB/STOSB
INVPCID =   0x0000008000000000, // Invalidate Processor Context ID
RTM =       0x0000010000000000  // Restricted Transactional Memory

And for the two non-Haswell architectures, I'd need to know which ones are not supported.

bchurchill commented 9 years ago

The important ones are avx/avx2/bmi/bmi2.

avx2, bmi, bmi2 are only on haswell+ avx is on sandybridge and haswell

I think fma is only on haswell also.

For the others, it would be best to find a sandybridge/nehalem system and check. For maximum compatibility, we could include all the SSE*, popcnt, CMOV and throw out the rest (it's not like we support the floating point stack anyway, for example).

I'm beginning to realize there's something funky about our approach. Why have different builds for different processors at all? Why not have it all determined dynamically? We could then be compatible on any 64-bit x86 linux platform. I think the only reason is because there are places where Eric had written high performance code using gcc intrinsics that emitted vector instructions. However, I would say that this should be the compiler's job, not ours, and we should just write the C++ code. Then, we would only have a single build that would work across more platforms, and it would mean less testing time, etc.

We could even re-purpose one of the ancient boxes lying in our office to be a second Jenkins machine, just for compatibility testing. I'd be willing to set that up, after PLDI of course.

On 09/18/2015 04:39 PM, Stefan Heule wrote:

I can implement this, but I don't quite know which flags are not available on which architecture. Do you know (or know where to look)? These are the flags we have:

FPU = 0x0000000000000001, // On-board FPU TSC = 0x0000000000000002, // RDTSC MSR = 0x0000000000000004, // RDMSR WRMSR CX8 = 0x0000000000000008, // CMPXCHG8 SEP = 0x0000000000000010, // SYSENTER SYSEXIT CMOV = 0x0000000000000020, // CMOV FCMOV CLFLUSH = 0x0000000000000040, // CLFLUSH MMX = 0x0000000000000080, // Multimedia Extensions FXSR = 0x0000000000000100, // FXSAVE FXRSTOR SSE = 0x0000000000000200, // Intel SSE Vector Instructions SSE2 = 0x0000000000000400, // SSE2 SYSCALL = 0x0000000000000800, // SYSCALL SYSRET RDTSCP = 0x0000000000001000, // RDTSCP REP_GOOD = 0x0000000000002000, // REP Microcode works well NOPL = 0x0000000000004000, // NOPL (0f 1f) Instructions PNI = 0x0000000000008000, // SSE3 (Prescott New Instructions) PCLMULQDQ = 0x0000000000010000, // PCLMULQDQ MONITOR = 0x0000000000020000, // MONITOR MWAIT SSSE3 = 0x0000000000040000, // Supplemental SSE-3 FMA = 0x0000000000080000, // Fused Multiply-Add CX16 = 0x0000000000100000, // CMPXCHG16B SSE4_1 = 0x0000000000200000, // SSE-4.1 SSE4_2 = 0x0000000000400000, // SSE-4.2 MOVBE = 0x0000000000800000, // MOVBE POPCNT = 0x0000000001000000, // POPCNT AES = 0x0000000002000000, // AES instructions XSAVE = 0x0000000004000000, // XSAVE XRSTOR XSETBV XGETBV AVX = 0x0000000008000000, // Advanced Vector Extensions F16C = 0x0000000010000000, // 16-bit FP Conversions RDRAND = 0x0000000020000000, // RDRAND LAHF_LM = 0x0000000040000000, // LAHF SAHF (in long mode) ABM = 0x0000000080000000, // Advanced Bit Manipulation XSAVEOPT = 0x0000000100000000, // Optimized XSave FSGSBASE = 0x0000000200000000, // {RD/WR}{FS/GS}BASE BMI1 = 0x0000000400000000, // 1st Group Bit-manipulation Extensions HLE = 0x0000000800000000, // Hardware Lock Ellision AVX2 = 0x0000001000000000, // AVX2 Instructions BMI2 = 0x0000002000000000, // 2nd Group Bit-manipulation Extensions ERMS = 0x0000004000000000, // Enhanced REP MOVSB/STOSB INVPCID = 0x0000008000000000, // Invalidate Processor Context ID RTM = 0x0000010000000000 // Restricted Transactional Memory

And for the two non-Haswell architectures, I'd need to know which ones are not supported.

— Reply to this email directly or view it on GitHub https://github.com/StanfordPL/stoke/issues/713#issuecomment-141595286.

stefanheule commented 9 years ago

Thanks.

I mostly agree, but there is one important advantage to having different builds (though I'm sure we could do even this in a better way): It allows us to do some testing of the different builds on a Haswell machine. Clearly, testing on a proper nehalem machine would be better, but building STOKE with some compile options turned off and making sure it still works on a Haswell machine is still much better than no testing at all.

So, I do think having a second Jenkins on an older machine would be nice, but I also think there's value in running some tests for other architectures on a Haswell machine if we don't have a machine of that type lying around.

bchurchill commented 9 years ago

I think though that testing nehalem on a haswell machine isn't ideal for two reasons. First, there are false negatives: where everything seems fine but it's actually using some haswell-only instructions. Then the build is useless to people who actually have a nehalem machine, and I've experienced this in the stoke codebase in the past. Also, false positives, like we're encountering now. If we really want to test on a Haswell system we could use a virtual machine, but I'd rather not burden mrwhite with that anyway.

On 09/18/2015 05:06 PM, Stefan Heule wrote:

Thanks.

I mostly agree, but there is one important advantage to having different builds (though I'm sure we could do even this in a better way): It allows us to do /some/ testing of the different builds on a Haswell machine. Clearly, testing on a proper nehalem machine would be better, but building STOKE with some compile options turned off and making sure it still works on a Haswell machine is still much better than no testing at all.

So, I do think having a second Jenkins on an older machine would be nice, but I also think there's value in running some tests for other architectures on a Haswell machine if we don't have a machine of that type lying around.

— Reply to this email directly or view it on GitHub https://github.com/StanfordPL/stoke/issues/713#issuecomment-141599050.

stefanheule commented 9 years ago

Yeah, I agree, it's by no means ideal. But it's better than nothing, when we first built the nehalem support, we had many genuine failures where it was useful to have the testing on Haswell machines. But clearly a real nehalem machine would be much better.

bchurchill commented 9 years ago

That makes sense. What do you want to do right now? I'm fine with whatever you decide.

On 09/18/2015 05:24 PM, Stefan Heule wrote:

Yeah, I agree, it's by no means ideal. But it's better than nothing, when we first built the nehalem support, we had many genuine failures where it was useful to have the testing on Haswell machines. But clearly a real nehalem machine would be much better.

— Reply to this email directly or view it on GitHub https://github.com/StanfordPL/stoke/issues/713#issuecomment-141600281.

stefanheule commented 9 years ago

I fixed the build in the way I described, using your suggesting of changing CpuInfo appropriately with #ifdefs. After this I had to disable a few other tests in the Nehalem and Sandybridge builds (because they use instructions that aren't available there). This should solve this issue, and I'll open a new ticket to set up a proper Nehalem/Sandybridge test on an actual machine.

stefanheule commented 9 years ago

There are some typos in this commit, reopening.