stoke bails on registers written by bsr bsf instructions

travisdowns commented 7 years ago

I'm trying to optimize a simple function with the following disassembly:

._Z6div_32jj:             #        0x400570  0      OPC=<label>            
  leal -0x1(%rsi), %eax   #  1     0x400570  3      OPC=leal_r32_m16       
  testl %edi, %eax        #  2     0x400573  2      OPC=testl_r32_r32      
  setne %al               #  3     0x400575  3      OPC=setne_r8           
  bsrl %esi, %esi         #  4     0x400578  3      OPC=bsrl_r32_r32       
  movzbl %al, %eax        #  5     0x40057b  3      OPC=movzbl_r32_r8      
  shrxl %esi, %edi, %edi  #  6     0x40057e  5      OPC=shrxl_r32_r32_r32  
  addl %edi, %eax         #  7     0x400583  2      OPC=addl_r32_r32       
  retq                    #  8     0x400585  1      OPC=retq

stoke synthesize fails almost immediately with:

FATAL ERROR: (_Z6div_32jj) Reads from an undefined location: Instruction 'shrxl %esi, %edi, %edi' reads { %esi %edi } but only { %rax %rdi %zf %mxcsr::rc[0] } are defined.

However, esi is written by te earlier bsrl instruction. I'm guessing the complaint is something to do with the fact that bsrl has undefined behavior when the input reg is 0, and the current intel behavior is to leave the reg unchanged. That doesn't really explain it all though because actually both src and dest for the bsrl are esi which is in my def in: --def_in "{ %rax %rdi %rsi }".

travisdowns commented 7 years ago

I should add that a small tweak to the C++ code results in a swap to using the tzcnt instruction:


  xorl %eax, %eax           #  1     0x400570  2      OPC=xorl_r32_r32       
  xorl %edx, %edx           #  2     0x400572  2      OPC=xorl_r32_r32       
  tzcntl %esi, %eax         #  3     0x400574  4      OPC=tzcntl_r32_r32     
  subl $0x1, %esi           #  4     0x400578  3      OPC=subl_r32_imm8      
  testl %edi, %esi          #  5     0x40057b  2      OPC=testl_r32_r32      
  shrxl %eax, %edi, %edi    #  6     0x40057d  5      OPC=shrxl_r32_r32_r32  
  setne %dl                 #  7     0x400582  3      OPC=setne_r8           
  leal (%rdx,%rdi,1), %eax  #  8     0x400585  3      OPC=leal_r32_m16       
  retq                      #  9     0x400588  1      OPC=retq

This code doesn't have the same issue (tzcnt doesn't have the undefined behavior and in any case the regs are zeroed with xor first), and stoke doens't complain.

bchurchill commented 7 years ago

I'm guessing the complaint is something to do with the fact that bsrl has undefined behavior when the input reg is 0, and the current intel behavior is to leave the reg unchanged.

I haven't looked at the code, but this sounds like the right explanation to me. In general, STOKE needs to make sure that the programs it generates do not perform reads on potentially undefined data. In the most general situation, this is an undecidable problem. So, we have a very simple analysis that approximates the solution, but sometimes it gives false positives, especially when whether something is defined or undefined depends on instruction semantics or the processor implementation.

You can try the --live_dangerously option, which will skip the startup check, but you'll have the problem that STOKE won't accept any rewrite which it thinks does an undefined read, so this probably won't help you. For your specific problem, I would recommend modifying the x86.csv in x64asm so that bsrl is considered as doing a write rather than leaving a register undefined. Could that work for you?

travisdowns commented 7 years ago

It could work, but I'm also confused as to why stoke needs the program at all in the synthesize step? I would have thought it works only on the test case input (which never contains any 0 input, so should never trigger the undefined behavior). Understanding what stoke actually uses the program for, when I'm doing synthesis and using "holdback" validation would help me understand better how to work around it.

The problem with modifying the x86.csv is that then the guarantees offered by the instruction would be stronger than the CPU implementation (i.e., I need to choose some behavior for input = 0 - say output = 0?), so stoke may get stuck finding solutions that rely on this behavior (e.g., using the fact that bsf(0) == 0). I suppose one workaround to that could be setting the value for bsf(0) to be something very random that is not likely to be useful in a solution and then verifying that this isn't used by the solution.

travisdowns commented 7 years ago

I suppose another workaround is just to zero out the destination register before the bsr call - can I freely modify the assembly in the bins/ dir (created by extract)? That is, are the other columns: # 1 0x400570 3 OPC=leal_r32_m16 used by stoke or are they informational only?

stefanheule commented 7 years ago

It could work, but I'm also confused as to why stoke needs the program at all in the synthesize step?

As @bchurchill pointed out, STOKE uses a simple data-flow analysis that ensures we only ever deal with programs that are guaranteed to not read undefined values. Getting a precise and fast analysis is difficult, so ours is an over-approximation that works reasonably well in most cases, but means STOKE can essentially NOT use the bsr instruction at all (because it can NEVER prove that the program doesn't read an undefined value, unless it just doesn't read the main output, which would not be so useful). This is unfortunate, but for now we don't have any real work-around.

The closest to a work-around is changing the x86.csv file, but as you note, that means STOKE will now use bsr even for 0 inputs.

stefanheule commented 7 years ago

I suppose another workaround is just to zero out the destination register before the bsr call - can I freely modify the assembly in the bins/ dir (created by extract)? That is, are the other columns: # 1 0x400570 3 OPC=leal_r32_m16 used by stoke or are they informational only?

Yes, you can modify those. The columns are not purely informational, they are used by the parser in some special cases, but if you remove them and modify the assembly, everything should work. Just make sure when you modify the code that you remove the comments (the parser doesn't NEED them, but will use them if they are present).

Essentially the parser uses the comments to dissambiguate between instructions that have the same string but can be representated in multiple ways (e.g. an instruction that adds the constant $0x1 to %rax can be represented by an instruction that uses an 8, 16 or 32 bit constant).

travisdowns commented 7 years ago

Can I leave the comments intact for the existing instructions, when I add my xor eax, eax (which won't have any comments).

stefanheule commented 7 years ago

Yes.

travisdowns commented 7 years ago

Thanks. I'm still curious what stoke uses the target program when synth/optimizing with "holdback", with --init zero. Is it only to calculate the cost to beat?

stefanheule commented 7 years ago

It is used for calculating the cost, yes (in particular to calculate the correct output, so the "correctness" term in the cost).

travisdowns commented 7 years ago

... but I thought that only the existing test cases only were used for the correctness calculation. I.e., that the correctness score is the sum (or other reduction) of the errors across all the applicable test cases, and it doesn't know more about the function than that (when using "holdback" validation strategy).

Does the synthesize step actually re-run the target function as part of the correctness calculation? This difference is important, because I may craft my test cases to avoid certain invalid inputs, and want a rewritten function that works for the remaining inputs, but I don't care what happens when invalid inputs are used. Furthermore, I may modify the target function (e.g., adding the xors or something else) and I want to know if stoke is using it.

I get that it will use it if you use the other proof-based validations, etc.

stefanheule commented 7 years ago

Test cases are only test inputs, so you need the target to determine what the correct output is. The "correctness" part of the cost function takes the test cases, runs both the target and rewrite on those inputs and compares the outputs (only live out).

travisdowns commented 7 years ago

Ah yes, I had already noted at some point that the test cases didn't define any output, but somehow forgot that fact!

On Nov 9, 2016 2:10 PM, "Stefan Heule" notifications@github.com wrote:

Test cases are only test inputs, so you need the target to determine what the correct output is. The "correctness" part of the cost function takes the test cases, runs both the target and rewrite on those inputs and compares the outputs (only live out).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/StanfordPL/stoke/issues/939#issuecomment-259513379, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSswViUPtkLhgKxRsOIsDxGzqEJscW1ks5q8iiXgaJpZM4KtNaX .

bchurchill commented 7 years ago

I'll go ahead and close this -- feel free to open a new issue if you're having other problems.

travisdowns commented 7 years ago

Well I am still having this problem...

StanfordPL / stoke

stoke bails on registers written by bsr bsf instructions #939