`m68k` header very unorthogonal

ibabushkin commented 8 years ago

I am currently working on updated Haskell bindings for the upcoming release, and binding to the new header include/capstone/m68k.h proved more difficult than assumed. The issue isn't just the fact that the architecture in question is CISC and thus more complicated than the RISC instruction sets we support, but also the fact that the new header is different in coding style, poorly documented and in some cases highly unorthogonal.

Examples: For most architectures, we have a tagged union in the cs_<arch>_op struct, that holds information on an instruction's operands. The tags are elements of the <arch>_op_type enum. If there is not one-to-one correspondence between enum values and union members, it's noted in the comments, otherwise the connection is trivially clear. For instance, on MIPS:

//> Operand type for instruction's operands
typedef enum mips_op_type {
    MIPS_OP_INVALID = 0, // = CS_OP_INVALID (Uninitialized).
    MIPS_OP_REG, // = CS_OP_REG (Register operand).
    MIPS_OP_IMM, // = CS_OP_IMM (Immediate operand).
    MIPS_OP_MEM, // = CS_OP_MEM (Memory operand).
} mips_op_type;

// ...

// Instruction operand
typedef struct cs_mips_op {
    mips_op_type type;  // operand type
    union {
        mips_reg reg;       // register value for REG operand
        int64_t imm;        // immediate value for IMM operand
        mips_op_mem mem;    // base/index/scale/disp value for MEM operand
    };
} cs_mips_op;

Obviously, providing robust marshalling code for such a structure is easy. In this case, however, such a clear path can't be taken:

//> Operand type for instruction's operands
typedef enum m68k_op_type {
    M68K_OP_INVALID = 0, // = CS_OP_INVALID (Uninitialized).
    M68K_OP_REG,         // = CS_OP_REG (Register operand).
    M68K_OP_IMM,         // = CS_OP_IMM (Immediate operand).
    M68K_OP_MEM,         // = CS_OP_MEM (Memory operand).
    M68K_OP_FP,          // = CS_OP_FP  (Floating-Point operand)
    M68K_OP_REG_BITS,    // Register bits move
    M68K_OP_REG_PAIR,    // Register pair in the same op (upper 4 bits for first reg, lower for second) 
} m68k_op_type;

// ....

// Instruction operand
typedef struct cs_m68k_op {
    union {
        // everything clear here
        uint64_t imm;           // immediate value for IMM operand
        // how should I decide, what floating point value to use? I can't just take 32 bits from a double, obviously
        double dimm;            // double imm
        float simm;             // float imm
        m68k_reg reg;           // register value for REG operand
        m68k_op_mem mem;        // data when operand is targeting memory
        uint32_t register_bits; // register bits for movem/cas2/etc (always in d0-d7, a0-a7, fp0 - fp7 order)
    };
    m68k_op_type type;
    m68k_address_mode address_mode; // M68K addressing mode for this op
} cs_m68k_op;

My assumption is that a lot of the missing data is context-sensitive and essentially stored in a different structure. This approach has two problematic edges for me:

porting these structures to the bindings is pretty complicated and could be even impossible in some cases
such gotchas should probably be documented

Thus, I suggest that this issue gets adressed in the following way:

make clear how to read the union properly, and if possible, provide a way to do it without access to data outside the cs_m68k_op struct (otherwise I can't provide Haskell code to deal with this, sadly) - I'd love to help here, but I'd need some pointers regarding the architecture, as the reference didn't provide quick insight
document the result properly (I'd be able to help here as well, if I participate in the part above)

Also, in some cases, I'd use bool instead of uint8_t, but that's trivial to fix and not really an issue either.

That's it for now, everything else hapstone-related works fine :)

aquynh commented 8 years ago

@emoon, do you have any comments?

emoon commented 8 years ago

This is how the code in the C writer works

    case M68K_AM_IMMIDIATE:
             if (inst->op_size.type == M68K_SIZE_TYPE_FPU) {
                 if (inst->op_size.fpu_size == M68K_FPU_SIZE_SINGLE)
                     SStream_concat(O, "#%f", op->simm);
                 else if (inst->op_size.fpu_size == M68K_FPU_SIZE_DOUBLE)
                     SStream_concat(O, "#%f", op->dimm);
                 else
                     SStream_concat(O, "#<unsupported>");
                 break;
             }
             SStream_concat(O, "#$%x", op->imm);
             break;

The problem is that on 68K you can have different types of immediate values of the same size but different data.

For example

fmove.s #1.0,fp0

Immediate here is a float (float)

fmove.l #1,fp0

Immediate here is 32-bit (integer)

This make the deal with immediate values quite more complicated and thus I store

// Type of size that is being used for the current instruction
typedef enum m68k_size_type {
    M68K_SIZE_TYPE_INVALID = 0,
    M68K_SIZE_TYPE_CPU,
    M68K_SIZE_TYPE_FPU,
} m68k_size_type;

// Operation size of the current instruction (NOT the actually size of instruction)
typedef struct m68k_op_size {
    m68k_size_type type;
    union {
        m68k_cpu_size cpu_size;
        m68k_fpu_size fpu_size;
    };
} m68k_op_size;

// The M68K instruction and it's operands
typedef struct cs_m68k {
    // Number of operands of this instruction or 0 when instruction has no operand.
    cs_m68k_op operands[M68K_OPERAND_COUNT]; // operands for this instruction.
    m68k_op_size op_size;   // size of data operand works on in bytes (.b, .w, .l, etc)
    uint8_t op_count; // number of operands for the instruction
} cs_m68k;

So in order to deal with this I store a op_size which can be either for the FPU and CPU. (because as written above LONG and FLOAT is the same size)

Relevant discussion also https://github.com/aquynh/capstone/pull/487#discussion_r41082933

ibabushkin commented 8 years ago

But if we extend the m68k_op_type union to reflect on this, we'd get a few benefits:

we have the orthogonality needed (I'll elaborate on why this is needed in my usecase below)
we don't impose any overhead anywhere and don't break much because all structure and enum sizes stay the same
that way the m68k architecture is in sync with the others in terms of usage and complexity

As for the orthogonality thing: In Haskell, I provide marshalling code for each structure in the public headers, which works independently for each instance I store or read from a location. I could just provide incomplete code to do this and make things context-sensitive, but this adds a few rough edges, as people expect the marshalling routines to work differently.

Based on that, I suggest I'll try to implement this proposal (without deleting actual code), if you guys are interested in this. I also think we should do this before the release to avoid incompatibilities with third-party software later on.

Followup question: I think I understand the problem with immediate values better now, but why are operands with a size of 12 bytes (M68K_FPU_SIZE_EXTENDED) nowhere to be found? Also, why do we store two registers in one byte for register pairs? We have enough space in the union to put a small inline struct there without increasing overall struct size.

emoon commented 8 years ago

Based on that, I suggest I'll try to implement this proposal (without deleting actual code), if you guys are interested in this. I also think we should do this before the release to avoid incompatibilities with third-party software later on.

Sure. I don't mind at all as long as the code has the same behaviour.

Followup question: I think I understand the problem with immediate values better now, but why are operands with a size of 12 bytes (M68K_FPU_SIZE_EXTENDED) nowhere to be found?

I never implemented support for it. I have never seen any code actually use it in the wild (for immediate values) and it would require doing a conversion from 96 bits to 64 to actually print the value, while this might not be that hard to do it's currently un-implemented.

Also, why do we store two registers in one byte for register pairs? We have enough space in the union to put a small inline struct there without increasing overall struct size.

Sure that can be changed if wanted.

ibabushkin commented 8 years ago

Alright, I'll send a pull request these days.

emoon commented 8 years ago

👍

ibabushkin commented 8 years ago

Hm, this is weird: grepping for M68_OP_FP didn't reveal anything. Did you mean by this:

I never implemented support for it.

that floating point isn't used at all so far? That's not what I understood from your original comment, that's why I'm asking.

emoon commented 8 years ago

No, FPU instructions are implemented https://github.com/aquynh/capstone/blob/next/arch/M68K/M68KDisassembler.c#L1991 but I guess M68K_OP_FP isn't being set.

I was only referring to 96-bit floating point immediate values when saying I never added support for it.

ibabushkin commented 8 years ago

Alright, that's what I assumed. I will investigate further on why the operand type constant isn't used.

emoon commented 8 years ago

Alright, that's what I assumed. I will investigate further on why the operand type constant isn't used

I would guess that everything works because the instruction printer doesn't actually care what kind of opcode it is.

As this code

https://github.com/aquynh/capstone/blob/next/arch/M68K/M68KDisassembler.c#L1928

Deals with the FPU stuff it will set the appropriate used FPn register and the instruction printer just refers to a table where all the registers are present.

ibabushkin commented 8 years ago

Hm, it seems as if the function you linked does set the type to M68K_OP_IMM unconditionally. I assume the situation would be fixed if I inserted an update to the field here:

https://github.com/aquynh/capstone/blob/next/arch/M68K/M68KDisassembler.c#L2076

and in the adjacent lines. I will do this now, so I think the fix will be complete that way.

capstone-engine / capstone

`m68k` header very unorthogonal #676