StanfordPL / x64asm

x86-64 assembler library
Apache License 2.0
470 stars 60 forks source link

Large number of bugs in the pretty-printer, parser, spreadsheet and other components. #240

Closed stefanheule closed 7 years ago

stefanheule commented 7 years ago

Note: The scope of this issue has been significantly expanded, see messages further below.

For instance cmpq $0x80, %rdi # SIZE=4 gets parsed incorrectly into CMP_R64_IMM32. That opcode doesn't even have size 4.

bchurchill commented 7 years ago

Nah, that's correct. It's possible to encode the immediate as one byte and it will be sign extended to 64 bits. The SIZE is there to disambiguate between the encodings that take different immediate sizes. SIZE refers to total encoding size, not the opcode size.

On Aug 21, 2017 5:17 PM, "Stefan Heule" notifications@github.com wrote:

For instance cmpq $0x80, %rdi # SIZE=4 gets parsed incorrectly into CMP_R64_IMM32. That opcode doesn't even have size 4.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/StanfordPL/x64asm/issues/240, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnG6aMv1c1x8MJJIgNGn4d3joI6Y7c8ks5sah4AgaJpZM4O993Y .

stefanheule commented 7 years ago

Sorry, that was a very short description, but I'm pretty sure this is a real bug. Let me try to be more explicit:

I'm trying to parse cmpq $0x80, %rdi # SIZE=4, which should only succeed if the instruction fits in at most 4 bytes (total encoding size, as you correctly say). For that, the only option is to parse as opcode CMP_R64_IMM8 (with the encoding 48 83 ff 80), but it gets parsed as CMP_R64_IMM32 instead. This is incorrect, because the CMP_R64_IMM32 opcode requires 7 bytes (48 81 ff 80 ff ff ff).

bchurchill commented 7 years ago

Ah, yes, you're right. I guess I was just surprised that doesn't work right! I thought that the parser actually assembles what it thinks it has parsed to make sure the length is right before accepting its input?

On Aug 21, 2017 11:48 PM, "Stefan Heule" notifications@github.com wrote:

Sorry, that was a very short description, but I'm pretty sure this is a real bug. Let me try to be more explicit:

I'm trying to parse cmpq $0x80, %rdi # SIZE=4, which should only succeed if the instruction fits in at most 4 bytes (total encoding size, as you correctly say). For that, the only option is to parse as opcode CMP_R64_IMM8 (with the encoding 48 83 ff 80), but it gets parsed as CMP_R64_IMM32 instead. This is incorrect, because the CMP_R64_IMM32 opcode requires 7 bytes (48 81 ff 80 ff ff ff).

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/StanfordPL/x64asm/issues/240#issuecomment-323935242, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnG6eIuTtFYHlKXYGHpGbJSSkLuQR2Kks5sannBgaJpZM4O993Y .

stefanheule commented 7 years ago

Yeah, that's what it is supposed to do, but we got it wrong. My proposed fix for this is in https://github.com/StanfordPL/x64asm/tree/issue-240

bchurchill commented 7 years ago

aha, I see. Just glancing at the code that fix looks right.

On Aug 21, 2017 11:58 PM, "Stefan Heule" notifications@github.com wrote:

Yeah, that's what it is supposed to do, but we got it wrong. My proposed fix for this is in https://github.com/StanfordPL/x64asm/tree/issue-240

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/StanfordPL/x64asm/issues/240#issuecomment-323937119, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnG6THvyb5VNvXG-12rNKyWnlAcw3zOks5sanwugaJpZM4O993Y .

bchurchill commented 7 years ago

I wonder how many hours I've wasted because of this bug and never known? Looks pretty bad...

On Aug 22, 2017 12:15 AM, "Berkeley Churchill" berkeleychurchill@gmail.com wrote:

aha, I see. Just glancing at the code that fix looks right.

On Aug 21, 2017 11:58 PM, "Stefan Heule" notifications@github.com wrote:

Yeah, that's what it is supposed to do, but we got it wrong. My proposed fix for this is in https://github.com/StanfordPL/x64asm/tree/issue-240

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/StanfordPL/x64asm/issues/240#issuecomment-323937119, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnG6THvyb5VNvXG-12rNKyWnlAcw3zOks5sanwugaJpZM4O993Y .

stefanheule commented 7 years ago

Okay, I'm trying to dig into this issue a bit more, and there are more problems.

  1. We can't parse orw $0xffff, %r8w # SIZE=5 (i.e. as orw_r16_imm8), even though we should.
  2. We pretty print orw_r16_imm8 with the constant 0xff as orw $0xff, %r8w. This is not correct, or at the very least not consistent with other assemblers. Other assemblers will interpret this as or-ing r8w with 0x00ff, not 0xffff which is correct. We should be printing it as orw $0xffff, %r8w.

To fix this, we will have to take into account the following the operand size. For the orw examples above, it would be 16-bits. With this, we can realize that 0xffff can fit in the 8-bit immediate of orw_r16_imm8, but 0xff cannot and we need to use orw_r16_imm16 instead. For printing, we need to do something similar, where we can't print the 8-bit constant 0xff just as 0xff, but instead print it as 0xffff, if it would cause confusion otherwise.

Doing this properly should also allow us to completely get rid of the weird "poor" encoding thing we are doing right now in the parser, I believe.

Does this sound right? Am I missing something? @bchurchill, it would definitely good to hear your thoughts, as you know much more about this than I do.

stefanheule commented 7 years ago

Okay, I've re-implemented the handling of immediates for both printing and parsing, and fixed several bugs in the process.

Changes are here: https://github.com/StanfordPL/x64asm/compare/issue-240 Together with some fixes over in STOKE: https://github.com/StanfordPL/stoke/compare/issue-974

I'll do some more testing, but so far all the cases that didn't use to work or work incorrectly seem to be okay.

jrkoenig commented 7 years ago

I added parts to the haskell code that generates alternate instructions with the .s suffix. This should work for parsing, in that addb %cl, %dl should decode to the regular opcode whereas addb.s should be the alternate one.

Right now addb.s (%ecx), %dl won't work but the .s suffix should only be present if there is some ambiguity. The pretty printer currently does not add these suffixes, but the OPCODE= comment will disambiguate properly because the alternate coding is still in the list of possibilities for addb (no suffix), but won't be selected unless an OPCODE forces it to be.

stefanheule commented 7 years ago

Okay, I fixed all these problems. If you use STOKE, you may want to update to the latest develop version. However, it is not fully backwards compatible in the following sense: x86 programs that STOKE found or disassembled (all the .s files produced by STOKE) may no longer parse correctly (because they were printed incorrectly by the old STOKE).

Here is a description of the issues that were fixed:

Printing and parsing immediates did not work correctly. Some examples:

Under some circumstances, we would parse instructions with the wrong opcode. For instance: I'm trying to parse cmpq $0x80, %rdi # SIZE=4, which should only succeed if the instruction fits in at most 4 bytes (total encoding size, as you correctly say). For that, the only option is to parse as opcode CMP_R64_IMM8 (with the encoding 48 83 ff 80), but it gets parsed as CMP_R64_IMM32 instead. This is incorrect, because the CMP_R64_IMM32 opcode requires 7 bytes (48 81 ff 80 ff ff ff).

Could not disassemble the instructions shl, shr, sal, sar, rcl, rcr, rol, ror if they use a memory operand.

Could not disassemble cmp/vcmp pseudo-opcodes like cmpeq_sd or cmpneq_oq_ss.

Spreadsheet bugs

Disassembler had a memory and file descriptor leak. Disassembling too many programs will cause the program to run out of file descriptors.

Disassembler could parse a n-byte sequence into an instruction that takes more than n bytes.

Disassembler could parse a single instruction into several instructions (padding with no-ops). That doesn't seem right and may not be what a user expects. This is no longer the case with one or two exceptions.

We didn't support the .s suffix of some instruction variants. Thanks to Jason for fixing this one.

bchurchill commented 7 years ago

Thank you for finding/fixing all of these Stefan!

On Fri, Aug 25, 2017 at 1:55 PM, Stefan Heule notifications@github.com wrote:

Okay, I fixed all these problems. If you use STOKE, you may want to update to the latest develop version. However, it is not fully backwards compatible in the following sense: x86 programs that STOKE found or disassembled (all the .s files produced by STOKE) may no longer parse correctly (because they were printed incorrectly by the old STOKE).

Here is a description of the issues that were fixed:

Printing and parsing immediates did not work correctly. Some examples:

  • We can't parse orw $0xffff, %r8w # SIZE=5 (i.e. as orw_r16_imm8), even though we should.
  • We pretty print orw_r16_imm8 with the constant 0xff as orw $0xff, %r8w. This is not correct, or at the very least not consistent with other assemblers. Other assemblers will interpret this as or-ing r8w with 0x00ff, not 0xffff which is correct (imms are sign-extended). We should be printing it as orw $0xffff, %r8w.

Under some circumstances, we would parse instructions with the wrong opcode. For instance: I'm trying to parse cmpq $0x80, %rdi # SIZE=4, which should only succeed if the instruction fits in at most 4 bytes (total encoding size, as you correctly say). For that, the only option is to parse as opcode CMP_R64_IMM8 (with the encoding 48 83 ff 80), but it gets parsed as CMP_R64_IMM32 instead. This is incorrect, because the CMP_R64_IMM32 opcode requires 7 bytes (48 81 ff 80 ff ff ff).

Could not disassemble the instructions shl, shr, sal, sar, rcl, rcr, rol, ror if they use a memory operand.

Could not disassemble cmp/vcmp pseudo-opcodes like cmpeq_sd or cmpneq_oq_ss.

Spreadsheet bugs

  • VCVTDQ2PD had the wrong operands
  • Suffix problems for VCVTPD2DQ, CVTSD2SI, VCVTSD2SI, CVTSS2SI, VCVTSS2SI, CVTTSD2SI, VCVTTSD2SI, CVTTSS2SI, VCVTTSS2SI, VMOVNTDQ, POPF, PUSHF, MOVNTI
  • Incorrect opcode for VMOVNTDQ
  • Typo in opcode for VPCMPGTQ
  • Typo in opcode for VBROADCASTI128

Disassembler had a memory and file descriptor leak. Disassembling too many programs will cause the program to run out of file descriptors.

Disassembler could parse a n-byte sequence into an instruction that takes more than n bytes.

Disassembler could parse a single instruction into several instructions (padding with no-ops). That doesn't seem right and may not be what a user expects. This is no longer the case with one or two exceptions.

We didn't support the .s suffix of some instruction variants. Thanks to Jason for fixing this one.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/StanfordPL/x64asm/issues/240#issuecomment-325032338, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnG6ZhiosR607aiwVY2hCKZ7j8Y4clXks5sbzTTgaJpZM4O993Y .

-- Berkeley