The generated parse table consumes too much memory

The parse tables occupy about 400MB in memory after they are constructed, as can be seen in this profile collected by @RyanGlScott: verify-RSA.saw.pdf. There are two factors to this memory consumption:

The parse tables include all permutations of prefix bytes https://github.com/GaloisInc/flexdis86/blob/c19b55e3bfe3147484c8b26bb12b56b29678d4b1/src/Flexdis86/Disassembler.hs#L415-L420
The parser is encoded as a graph of Haskell objects https://github.com/GaloisInc/flexdis86/blob/c19b55e3bfe3147484c8b26bb12b56b29678d4b1/src/Flexdis86/Disassembler.hs#L171-L180

Addressing the former is tricky. One could use a simple DFA to parse prefix bytes separately to save an enormous amount of space. However, not all prefixes are valid for all instructions; those restrictions are currently properly encoded in the fully elaborated tables. To separate out prefix parsing, it would be necessary to add a post-parsing check to see if the parse was valid or not.

Addressing the latter might be less tricky, as we could change the representation of the tables. Another disassembler uses a mostly unboxed structure: https://github.com/travitch/dismantle/blob/48433e7ccb02924b2f4695c8c9f09fb9cfccdfc4/dismantle-tablegen/src/Dismantle/Tablegen/LinearizedTrie.hs#L34. The x86 case is a bit trickier as the parser has more states than the parsers generated by dismantle. However, we might be able to take inspiration from the more compact parse table representation and adapt it for flexdis.

I've been working on a fix for (1)—that is, de-duplicating all of the different combinations of prefixes from the opcode table—for a while. I've pushed my current WIP branch here in case you'd like to follow along. That branch is a bit messy at the moment, since I'm adding a separate disassembler, named defaultX64Disassembler', that de-duplicates prefixes alongside the existing defaultX64Disassembler. I've also copied a fair bit of related functionality, giving them apostrophe suffixes as well to distinguish them from the existing functionality. Once I'm confident that the new disassembler is bug-free, I'll consolidate all of these functions.

First, here is the general approach that I'm taking in this branch:

The structure of the opcode table has changed from this:
```
data OpcodeTable
  = OpcodeTable !NextOpcodeTable
  | SkipModRM !Prefixes !Def
  | ReadModRMTable !(V.Vector ModTable)
  | ReadModRMUnchecked !ModTable
```
To this:
```
data OpcodeTable'
  = OpcodeTable' !NextOpcodeTable'
  | OpcodeTableEntry ![Def] -- Defs expecting a ModR/M byte
                     ![Def] -- Defs not expecting a ModR/M byte
```
Previously, the OpcodeTable data constructor was used any time there was an instruction opcode or a prefix byte. In the new design, the OpcodeTable' data constructor is only used for instruction opcodes—prefixes are not encoded into the structure of the table at all.

One consequence of this choice is that there can sometimes be multiple instruction definitions that use a particular set of opcodes. For instance, both the add and the vpshufb instructions use 00 as their opcode. For this reason, the OpcodeTableEntry data constructor, which represents a leaf entry in the opcode table, must contain a list of Defs rather than just a single one. In practice, these lists are quite small, and we may want to consider using a SmallArray to represent them.

Once we have an OpcodeTableEntry, we then use the list of parsed prefixes to disambiguate among all of the potential Defs in the list. (Actually, there are two lists, since Defs with a ModR/M byte have to be disassembled in a slightly different way. In principle, we could combine these into a single list, however.) Implementing the function that performs this disambiguation is one of the main challenges of this patch—more on this in a bit.

Instead of encoding all possible combinations of prefixes in the branching structure of the opcode table, we instead perform a parsing pass upfront that parses as many prefixes as possible before parsing instruction opcodes. Here is what this parsing pass looks like at the moment:

loopPrefixBytes :: Seq.Seq Word8 -> m InstructionInstance
loopPrefixBytes prefixBytes = do
 b <- readByte
 if |  b `elem` simplePrefixBytes
    -> loopPrefixBytes (prefixBytes Seq.|> b)

    |  b `elem` segPrefixBytes
    -> loopPrefixBytes (prefixBytes Seq.|> b)

    |  b `elem` rexPrefixBytes
    -> loopPrefixBytes (prefixBytes Seq.|> b)

       -- Two-byte VEX prefix
    |  b == 0xc5
    -> do b2 <- readByte
          loopPrefixBytes (prefixBytes Seq.>< Seq.fromList [b, b2])

       -- Three-byte VEX prefix
    |  b == 0xc4
    -> do b2 <- readByte
          b3 <- readByte
          loopPrefixBytes (prefixBytes Seq.>< Seq.fromList [b, b2, b3])

    |  otherwise
    -> loopOpcodes prefixBytes tr0 b -- Disassemble instruction based on opcodes

If this looks too simplistic to work, that's because that is. More on this later.

Once we have all of the prefixes and instruction opcodes, we have to use the prefixes to disambiguate among all of the possible Defs. My branch contains a validatePrefixBytes function that performs this disambiguation. The implementation is a bit too long to go into here, but among other things, it checks that:
- An instruction actually accepts all of the parsed prefixes
- If the instruction has a required prefix, then that prefix has been parsed
- The instruction has appropriate address and operand sizes as dictated by the prefixes (see the validPrefix funtion)
- etc.
There are an enormous number of possible checks that we could put into this function—see this page for some ideas. It might be best to only implement the checks we need to disambiguate instructions and add more later if they become necessary.
Remove some ugly hacks in the XML file representing all x86_64 instructions (see data/optable.xml). For instance, there are quite a few nop definitions that have dummy 66 prefixes to accommodate the way opcode table parsing currently works, accompanied by this comment:

https://github.com/GaloisInc/flexdis86/blob/7109bdc9990a3e756eb7fb07419737d15ad41da0/data/optable.xml#L5249-L5254

Since we now parse prefixes upfront, I believe this hack should no longer be necessary, so we can remove these silly 66-prefixed definitions.

Another similar 66 prefix hack can be found in one of the definitions of xchg:

https://github.com/GaloisInc/flexdis86/blob/7109bdc9990a3e756eb7fb07419737d15ad41da0/data/optable.xml#L8391

Similarly, I think we can just remove the 66 here now that we handle prefixes differently.

This approach was enough to get nearly the entire flexdis86 test suite to pass, save for one exception:

      pause:                                    FAIL
        Exception: TODO RGS: No parse
        CallStack (from HasCallStack):
          error, called at src/Flexdis86/Disassembler.hs:1115:10 in flexdis86-0.1.5-inplace:Flexdis86.Disassembler
        Use -p '/pause/' to rerun this test only.

Let's talk about why this happens. The prefix parsing approach that I took in step (2) makes a key assumption: bytes that can be used as prefixes will never be used as the first byte in an instruction's opcodes. After all, the only way we know how to stop disassembling prefixes is to encounter a byte that isn't in the set of known prefix bytes, which we interpret as an instruction's first opcode byte. Unfortunately, this assumption turns out not to be true. In the case of the pause instruction, we have:

https://github.com/GaloisInc/flexdis86/blob/7109bdc9990a3e756eb7fb07419737d15ad41da0/data/optable.xml#L5640-L5645

But the 0xf3 byte is also used for the repz prefix:

https://github.com/GaloisInc/flexdis86/blob/7109bdc9990a3e756eb7fb07419737d15ad41da0/src/Flexdis86/Disassembler.hs#L315

As a result, we mistakenly parse pause's first instruction opcode as a prefix, which prevents us from finding pause in the opcode table later. One idea for working around this is to check if an instruction's opcodes start with bytes that could be interpreted as prefixes, and if so, "backtrack" through the list of parsed prefixes to remove the prefixes that were mistakenly classified as prefixes. This is essentially how the Haskell disassembler library handles parsing pause.

VEX prefixes (which the disassembler library does not handle, AFAICT) complicate matters further. The lds and les instructions have opcodes which clash with 0xc5 and 0xc4, the two-byte and three-byte VEX prefixes, respectively:

https://github.com/GaloisInc/flexdis86/blob/7109bdc9990a3e756eb7fb07419737d15ad41da0/data/optable.xml#L4256-L4263 https://github.com/GaloisInc/flexdis86/blob/7109bdc9990a3e756eb7fb07419737d15ad41da0/data/optable.xml#L4274-L4281

What's more, these prefixes are expected to be followed by some number of additional bytes, so not only would need to "backtrack" over parsing the VEX prefix, we would also need to backtrack over parsing the additional bytes that follow the prefix. Moreover, since lds and les' opcodes are each one byte, the additional bytes that we mistakenly parse upfront would correspond to the operands to the instruction! Hoo boy. It's a bit scary that the flexdis86 test suite doesn't catch this.

In short, I think we need some kind of way to perform this backtracking. I have some ideas for how to do this, but before I set out on implementing this idea, I wanted to do a quick sanity check with @travitch to make sure I'm on the right path. Can you see a simpler way to solve these problems?

I wonder if, instead of "backtracking", we could just remember the last byte in the prefixes we parsed, and incorporate those "last bytes" into the validity/opcode checks.

I also think we might need to be a bit liberal in accepting prefixes that are invalid if the parse is otherwise unambiguous (e.g., see https://repzret.org/p/repzret/)

I wonder if, instead of "backtracking", we could just remember the last byte in the prefixes we parsed, and incorporate those "last bytes" into the validity/opcode checks.

I considered this, but one complication with this idea is that the opcode table encodes the invariant that every instruction is reachable by a path containing at least one OpcodeTable data constructor. If this invariant is violated, you will reach one of these error cases:

https://github.com/GaloisInc/flexdis86/blob/7109bdc9990a3e756eb7fb07419737d15ad41da0/src/Flexdis86/Disassembler.hs#L994-L996

(In my branch, there is a corresponding error case for OpcodeTableEntry.)

For instructions like lds and les, the opcode consists of only a single byte. If we parse this byte as a prefix, then there are no bytes remaining to use for the OpcodeTable path, so we would not be able to store these instructions in the table. The backtracking idea is the only thing I can think of for making this work. That is, encode lds and les' opcodes into the the table as normal, parse its opcode as prefix bytes, and then when we try to check that we have the lds or les instruction, backtrack over the prefix bytes so that we can look them up in the opcode table as intended.

I also think we might need to be a bit liberal in accepting prefixes that are invalid if the parse is otherwise unambiguous (e.g., see https://repzret.org/p/repzret/)

Huh, that's an interesting read. I don't think this particular example poses an issue for my branch, as it can successfully disassemble f3 c3 using the set of validity checks that I have currently. But this is a worthwhile cautionary tale to not go overboard with adding too many additional validity checks, lest we reject examples like this one.

Regarding repz ret, I bet the data file specifies repz as an allowed prefix when it really isn't (to actually parse this correctly). I could be wrong - that is just a guess.

Regarding repz ret, I bet the data file specifies repz as an allowed prefix when it really isn't (to actually parse this correctly).

This seems likely, as this prefix was added for... reasons in b625205347ea5e9a459565c2c3e3f7c38b3ae60d. Nevertheless, I'm not too bothered by this, as this is a hack that would be needed in both the current and new designs. (If it were a hack that was only needed in one particular design, that would be a bit more eyebrow-raising.)

This might be effectively backtracking, but there could be a PseudoOpcodeTable constructor that acts mostly like OpcodeTable, except it means "Dispatch on the last byte decoded from a prefix". I'm browsing through to see how to implement that

This might be effectively backtracking, but there could be a PseudoOpcodeTable constructor that acts mostly like OpcodeTable, except it means "Dispatch on the last byte decoded from a prefix".

That might work, although things would get complicated for instructions whose operands are parsed eagerly as prefixes due to VEX. I was thinking of instead having a newtype like this:

newtype BacktrackingByteReader m a = BacktrackingByteReader (StateT [Word8] m a)

And giving it a ByteReader instance such that when readByte is called, it will read it from the [Word8] if it is non-empty (and pop off the byte afterward), and disassemble from bytes otherwise. This would avoid needing to change any of the operand-dissembling code to be aware of backtracking, as we could continue to program against the polymorphic ByteReader interface.

Just to record it, we talked about having a separate set of parse tables for VEX instructions that would be consulted iff the VEX prefix is in the set of parsed prefixes. That would avoid a need for backtracking in those cases.

It would be nice to know if there are cases besides pause that exist in the rest of the instruction space. pause is truly unfortunate, because it really isn't an instruction - it is just repz nop, which is treated as a special hint named pause. If there are a small set of such instructions (where they are just aliases), we may want to consider instead separating those out from the data table and just using them to guide pretty printing (rather than decoding).

Operationally, that would mean parsing f3 90 as a NOP, but noticing that it can be rendered as pause and fixing it up after the fact.

Just to record it, we talked about having a separate set of parse tables for VEX instructions that would be consulted iff the VEX prefix is in the set of parsed prefixes. That would avoid a need for backtracking in those cases.

As an experiment, I pushed a branch here that encodes VEX prefix bytes into the opcode table alongside the instruction opcodes to avoid any conflicts with instructions like lds and les. Here is how large the opcode table is on that branch:

λ> nextOpcodeSize' defaultX64Disassembler'
13143

For comparison, here is how large it is on the main branch:

λ> nextOpcodeSize defaultX64Disassembler
5869432

And here is how large it is after removing VEX prefixes:

λ> nextOpcodeSize' defaultX64Disassembler'
1409

That being said, one thing about lds and les that I did not realize until recently is that they only work in 32-bit mode, and since defaultX64Disassembler filters out instructions that don't work in 64-bit mode, they aren't even included in the table to begin with. As a result, I'm forced to partially retract my "hoo boy" comment from https://github.com/GaloisInc/flexdis86/issues/40#issuecomment-1161782312.

Speaking of which, are there other instructions whose opcodes conflict with bytes used for prefixes? Thankfully, les and lds are the only ones that conflict with VEX prefixes. What about the remaining 16 bytes used for prefixes? I audited data/optable.xml, and here are the results:

Segment prefixes

No conflicts in any of 0x26, 0x2e, 0x36, 0x3e, 0x64, or 0x65, thankfully enough.

"Simple" prefixes

0x66: As noted in https://github.com/GaloisInc/flexdis86/issues/40#issuecomment-1161782312, there were several hacky occurrences of nop and xchg that had 0x66 placed in front of their opcodes to work around limitations of the current approach to disassembling, but these hacks can be removed with the new approach.

Another conflict is vpcmpgtd:
```
<instruction>
  <mnemonic>vpcmpgtd</mnemonic>
  <class>avx</class>
  <def>
      <opc>/vex=NDS.128.66.0F.WIG 66</opc>
      <opr>Vx Hx Ux</opr>
  </def>
</instruction>
```
0x67: No conflicts
0xf0: No conflicts

0xf2: The vpslld instruction has some conflicts:

<instruction>
<mnemonic>vpslld</mnemonic>
<class>avx</class>
<def>
  <opc>/vex=NDS.128.66.0F.WIG F2</opc>
  <opr>Vx Hx Wx</opr>
</def>
...
<def>
  <opc>/vex=NDS.256.66.0F.WIG F2</opc>
  <opr>Vx Hx Wx</opr>
</def>
...
</instruction>

0xf3: endbr32 and endbr64 conflict:

<instruction>
<mnemonic>endbr32</mnemonic>
<def>
  <opc>F3 0F 1E FB</opc>
</def>
</instruction>

<instruction>
<mnemonic>endbr64</mnemonic>
<def>
  <opc>F3 0F 1E FA</opc>
</def>
</instruction>

As does vpsllq:

<instruction>
  <mnemonic>vpsllq</mnemonic>
  <class>avx</class>
  <def>
      <opc>/vex=NDS.128.66.0F.WIG F3</opc>
      <opr>Vx Ux Wx</opr>
  </def>
  ...
</instruction>

And our old friend pause:

<instruction>
  <mnemonic>pause</mnemonic>
  <def>
      <opc>f3 90</opc>
  </def>
</instruction>

REX prefixes (64-bit mode only)

The inc instruction can have opcodes 0x40 through 0x47 and the dec instruction can have opcodes 0x48 through 0x4f, all of which are REX prefixes. Each instruction has different opcodes in 64-bit mode, however, so this issue shouldn't show up in the flexdis86 test suite for the reasons described above.

The vpclmulqdq also conflicts with the 0x44 REX prefix:

<instruction>
  <mnemonic>vpclmulqdq</mnemonic>
  <class>avx</class>
  <def>
    <opc>/vex=NDS.128.66.0F3A.WIG 44</opc>
    <opr>Vdq Hdq Wdq Ib</opr>
  </def>
</instruction>

Several of these issues would be avoided by encoding VEX prefixes into the opcode table. Moreover, the inc and dec instructions wouldn't pose issues in 64-bit mode. By my count, that only leaves endbr32, endbr64, and pause as potential sources of conflict.

endbr32 and endbr64 are also special no-ops

At long last, #43 takes care of part (1). Part (2) may also be worth doing, but I imagine (1) alone will be enough to knock out most of the egregious memory usage, especially in light of https://github.com/GaloisInc/flexdis86/pull/43#issuecomment-1172601965.

GaloisInc / flexdis86