Closed stefanheule closed 7 years ago
Nah, that's correct. It's possible to encode the immediate as one byte and it will be sign extended to 64 bits. The SIZE is there to disambiguate between the encodings that take different immediate sizes. SIZE refers to total encoding size, not the opcode size.
On Aug 21, 2017 5:17 PM, "Stefan Heule" notifications@github.com wrote:
For instance cmpq $0x80, %rdi # SIZE=4 gets parsed incorrectly into CMP_R64_IMM32. That opcode doesn't even have size 4.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/StanfordPL/x64asm/issues/240, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnG6aMv1c1x8MJJIgNGn4d3joI6Y7c8ks5sah4AgaJpZM4O993Y .
Sorry, that was a very short description, but I'm pretty sure this is a real bug. Let me try to be more explicit:
I'm trying to parse cmpq $0x80, %rdi # SIZE=4
, which should only succeed if the instruction fits in at most 4 bytes (total encoding size, as you correctly say). For that, the only option is to parse as opcode CMP_R64_IMM8
(with the encoding 48 83 ff 80
), but it gets parsed as CMP_R64_IMM32
instead. This is incorrect, because the CMP_R64_IMM32
opcode requires 7 bytes (48 81 ff 80 ff ff ff
).
Ah, yes, you're right. I guess I was just surprised that doesn't work right! I thought that the parser actually assembles what it thinks it has parsed to make sure the length is right before accepting its input?
On Aug 21, 2017 11:48 PM, "Stefan Heule" notifications@github.com wrote:
Sorry, that was a very short description, but I'm pretty sure this is a real bug. Let me try to be more explicit:
I'm trying to parse cmpq $0x80, %rdi # SIZE=4, which should only succeed if the instruction fits in at most 4 bytes (total encoding size, as you correctly say). For that, the only option is to parse as opcode CMP_R64_IMM8 (with the encoding 48 83 ff 80), but it gets parsed as CMP_R64_IMM32 instead. This is incorrect, because the CMP_R64_IMM32 opcode requires 7 bytes (48 81 ff 80 ff ff ff).
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/StanfordPL/x64asm/issues/240#issuecomment-323935242, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnG6eIuTtFYHlKXYGHpGbJSSkLuQR2Kks5sannBgaJpZM4O993Y .
Yeah, that's what it is supposed to do, but we got it wrong. My proposed fix for this is in https://github.com/StanfordPL/x64asm/tree/issue-240
aha, I see. Just glancing at the code that fix looks right.
On Aug 21, 2017 11:58 PM, "Stefan Heule" notifications@github.com wrote:
Yeah, that's what it is supposed to do, but we got it wrong. My proposed fix for this is in https://github.com/StanfordPL/x64asm/tree/issue-240
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/StanfordPL/x64asm/issues/240#issuecomment-323937119, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnG6THvyb5VNvXG-12rNKyWnlAcw3zOks5sanwugaJpZM4O993Y .
I wonder how many hours I've wasted because of this bug and never known? Looks pretty bad...
On Aug 22, 2017 12:15 AM, "Berkeley Churchill" berkeleychurchill@gmail.com wrote:
aha, I see. Just glancing at the code that fix looks right.
On Aug 21, 2017 11:58 PM, "Stefan Heule" notifications@github.com wrote:
Yeah, that's what it is supposed to do, but we got it wrong. My proposed fix for this is in https://github.com/StanfordPL/x64asm/tree/issue-240
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/StanfordPL/x64asm/issues/240#issuecomment-323937119, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnG6THvyb5VNvXG-12rNKyWnlAcw3zOks5sanwugaJpZM4O993Y .
Okay, I'm trying to dig into this issue a bit more, and there are more problems.
orw $0xffff, %r8w # SIZE=5
(i.e. as orw_r16_imm8
), even though we should.orw_r16_imm8
with the constant 0xff
as orw $0xff, %r8w
. This is not correct, or at the very least not consistent with other assemblers. Other assemblers will interpret this as or-ing r8w with 0x00ff, not 0xffff which is correct. We should be printing it as orw $0xffff, %r8w
.To fix this, we will have to take into account the following the operand size. For the orw examples above, it would be 16-bits. With this, we can realize that 0xffff can fit in the 8-bit immediate of orw_r16_imm8
, but 0xff cannot and we need to use orw_r16_imm16
instead. For printing, we need to do something similar, where we can't print the 8-bit constant 0xff just as 0xff, but instead print it as 0xffff, if it would cause confusion otherwise.
Doing this properly should also allow us to completely get rid of the weird "poor" encoding thing we are doing right now in the parser, I believe.
Does this sound right? Am I missing something? @bchurchill, it would definitely good to hear your thoughts, as you know much more about this than I do.
Okay, I've re-implemented the handling of immediates for both printing and parsing, and fixed several bugs in the process.
Changes are here: https://github.com/StanfordPL/x64asm/compare/issue-240 Together with some fixes over in STOKE: https://github.com/StanfordPL/stoke/compare/issue-974
I'll do some more testing, but so far all the cases that didn't use to work or work incorrectly seem to be okay.
I added parts to the haskell code that generates alternate instructions with the .s suffix. This should work for parsing, in that addb %cl, %dl should decode to the regular opcode whereas addb.s should be the alternate one.
Right now addb.s (%ecx), %dl won't work but the .s suffix should only be present if there is some ambiguity. The pretty printer currently does not add these suffixes, but the OPCODE= comment will disambiguate properly because the alternate coding is still in the list of possibilities for addb (no suffix), but won't be selected unless an OPCODE forces it to be.
Okay, I fixed all these problems. If you use STOKE, you may want to update to the latest develop version. However, it is not fully backwards compatible in the following sense: x86 programs that STOKE found or disassembled (all the .s files produced by STOKE) may no longer parse correctly (because they were printed incorrectly by the old STOKE).
Here is a description of the issues that were fixed:
Printing and parsing immediates did not work correctly. Some examples:
Under some circumstances, we would parse instructions with the wrong opcode. For instance: I'm trying to parse cmpq $0x80, %rdi # SIZE=4, which should only succeed if the instruction fits in at most 4 bytes (total encoding size, as you correctly say). For that, the only option is to parse as opcode CMP_R64_IMM8 (with the encoding 48 83 ff 80), but it gets parsed as CMP_R64_IMM32 instead. This is incorrect, because the CMP_R64_IMM32 opcode requires 7 bytes (48 81 ff 80 ff ff ff).
Could not disassemble the instructions shl, shr, sal, sar, rcl, rcr, rol, ror if they use a memory operand.
Could not disassemble cmp/vcmp pseudo-opcodes like cmpeq_sd or cmpneq_oq_ss.
Spreadsheet bugs
Disassembler had a memory and file descriptor leak. Disassembling too many programs will cause the program to run out of file descriptors.
Disassembler could parse a n-byte sequence into an instruction that takes more than n bytes.
Disassembler could parse a single instruction into several instructions (padding with no-ops). That doesn't seem right and may not be what a user expects. This is no longer the case with one or two exceptions.
We didn't support the .s suffix of some instruction variants. Thanks to Jason for fixing this one.
Thank you for finding/fixing all of these Stefan!
On Fri, Aug 25, 2017 at 1:55 PM, Stefan Heule notifications@github.com wrote:
Okay, I fixed all these problems. If you use STOKE, you may want to update to the latest develop version. However, it is not fully backwards compatible in the following sense: x86 programs that STOKE found or disassembled (all the .s files produced by STOKE) may no longer parse correctly (because they were printed incorrectly by the old STOKE).
Here is a description of the issues that were fixed:
Printing and parsing immediates did not work correctly. Some examples:
- We can't parse orw $0xffff, %r8w # SIZE=5 (i.e. as orw_r16_imm8), even though we should.
- We pretty print orw_r16_imm8 with the constant 0xff as orw $0xff, %r8w. This is not correct, or at the very least not consistent with other assemblers. Other assemblers will interpret this as or-ing r8w with 0x00ff, not 0xffff which is correct (imms are sign-extended). We should be printing it as orw $0xffff, %r8w.
Under some circumstances, we would parse instructions with the wrong opcode. For instance: I'm trying to parse cmpq $0x80, %rdi # SIZE=4, which should only succeed if the instruction fits in at most 4 bytes (total encoding size, as you correctly say). For that, the only option is to parse as opcode CMP_R64_IMM8 (with the encoding 48 83 ff 80), but it gets parsed as CMP_R64_IMM32 instead. This is incorrect, because the CMP_R64_IMM32 opcode requires 7 bytes (48 81 ff 80 ff ff ff).
Could not disassemble the instructions shl, shr, sal, sar, rcl, rcr, rol, ror if they use a memory operand.
Could not disassemble cmp/vcmp pseudo-opcodes like cmpeq_sd or cmpneq_oq_ss.
Spreadsheet bugs
- VCVTDQ2PD had the wrong operands
- Suffix problems for VCVTPD2DQ, CVTSD2SI, VCVTSD2SI, CVTSS2SI, VCVTSS2SI, CVTTSD2SI, VCVTTSD2SI, CVTTSS2SI, VCVTTSS2SI, VMOVNTDQ, POPF, PUSHF, MOVNTI
- Incorrect opcode for VMOVNTDQ
- Typo in opcode for VPCMPGTQ
- Typo in opcode for VBROADCASTI128
Disassembler had a memory and file descriptor leak. Disassembling too many programs will cause the program to run out of file descriptors.
Disassembler could parse a n-byte sequence into an instruction that takes more than n bytes.
Disassembler could parse a single instruction into several instructions (padding with no-ops). That doesn't seem right and may not be what a user expects. This is no longer the case with one or two exceptions.
We didn't support the .s suffix of some instruction variants. Thanks to Jason for fixing this one.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/StanfordPL/x64asm/issues/240#issuecomment-325032338, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnG6ZhiosR607aiwVY2hCKZ7j8Y4clXks5sbzTTgaJpZM4O993Y .
-- Berkeley
Note: The scope of this issue has been significantly expanded, see messages further below.
For instance
cmpq $0x80, %rdi # SIZE=4
gets parsed incorrectly into CMP_R64_IMM32. That opcode doesn't even have size 4.