Encoding Information for ARM

AngelDev06 commented 1 year ago

To my knowledge the api exposes encoding specific information (such as where immediates and displacements are located) for the x86 architecture via the cs_x86_encoding struct. While I have checked the arm.h header file I was unable to find any struct that holds such information exposed to the public. Since my goal is to change the registers of some instructions I would like the api to provide information about their offsets if possible.

Rot127 commented 1 year ago

Now, for anyone who is interested in implementing this:

The auto-sync feature must be merged first before this can be done. So the minimum is https://github.com/capstone-engine/capstone/pull/1949 for ARM refactor.

If anyone plans to work on this before the Capstone release which introduces auto-sync (v5.1 or v6). The work on CS should be based on https://github.com/capstone-engine/capstone/pull/2026 (because it makes some more general changes to the auto-sync archs)

But in general the operand information is easy to come by because the operands bits are defined in the target definition files (see for ARM: https://github.com/capstone-engine/llvm-capstone/blob/auto-sync/llvm/lib/Target/ARM/ARMInstrInfo.td).

The tables for operand bit position and length should be generated in PrinterCapstone::asmMatcherEmitMatchTable() with a new function. Inspiration can be taken from this commit, which generates the instruction formats for PPC instructions.

Work on the PrinterCapstoneshould be base on https://github.com/capstone-engine/llvm-capstone/pull/10 if it wasn't merged before.

AngelDev06 commented 1 year ago

So if I want to implement this myself for my own project, do I need to modify the updater, generate a new a table consisting of info about the instructions and then modify the source?

Rot127 commented 1 year ago

Yes. It's probably better to extend the ARMGenCSMappingOps.inc (table with mapping structs: Mapping.h::map_insn_ops). Please refer to the documentation for details how the updater works: https://github.com/Rot127/capstone/blob/auto-sync-aarch64/docs/AutoSync.md and here https://github.com/Rot127/llvm-capstone/tree/tblgen_capstone_backends_aarch64

So if I want to implement this myself for my own project,

If would be nice though if you implement it for all of us ;)

AngelDev06 commented 1 year ago

hey, so I started implementing it and I am stuck on how I should approach this. Your first comment mentions that instruction encoding information is provided in the tablegen file named ARMInstrInfo. I assume that info is provided by the amount of bits required for each operands as shown here: (Rd, Rn, lsb and width in this case) Screenshot 2023-06-05 234952 My first attempt was to get the sizes by iterating over the fields with variable names that are equal to the tags specified in the input and output operands and getting the amount of bits required for it. While that would work for this specific instruction it doesn't work for a few others (and most specifically pseudo ones as they don't define such fields) as shown here: Screenshot 2023-06-06 005834 I also want to attempt to get the bit position by the let statements used to override the bits of the Inst field such as let Inst{3-0} = Rn; in my example but I am unable to find an api function that provides me those let statements as fields (so that I could get the bit range). Any ideas? (my intention is to try and achieve my target without having to refactor the tablegen files)

Rot127 commented 1 year ago

Whenever you work with Records the dump() function is your friend. You can print the fields of any Record and their types with Record->dump().

So for the CodeGenInstruction (this is the class you should use to get the encoding info. Because it is the same over all Targets) it gives:

ABSWr { // InstructionEncoding Instruction AArch64Inst EncodedI I Sched BaseOneOperandData Requires
  field bits<32> Inst = { 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, Rn{4}, Rn{3}, Rn{2}, Rn{1}, Rn{0}, Rd{4}, Rd{3}, Rd{2}, Rd{1}, Rd{0} };
  field bits<32> Unpredictable = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
  field bits<32> SoftFail = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
  int Size = 4;
  string DecoderNamespace = "";
  list<Predicate> Predicates = [HasCSSC];
  string DecoderMethod = "";
  bit hasCompleteDecoder = 1;
  string Namespace = "AArch64";
  dag OutOperandList = (outs GPR32:$Rd);
  dag InOperandList = (ins GPR32:$Rn);
  string AsmString = "abs       $Rd, $Rn";
  EncodingByHwMode EncodingInfos = ?;
  list<dag> Pattern = [(set GPR32:$Rd, (abs GPR32:$Rn))];
  list<Register> Uses = [];
  list<Register> Defs = [];
  int CodeSize = 0;
  int AddedComplexity = 0;
  bit isPreISelOpcode = 0;
  bit isReturn = 0;
  bit isBranch = 0;
  bit isEHScopeReturn = 0;
  bit isIndirectBranch = 0;
  bit isCompare = 0;
  bit isMoveImm = 0;
  bit isMoveReg = 0;
  bit isBitcast = 0;
  bit isSelect = 0;
  bit isBarrier = 0;
  bit isCall = 0;
  bit isAdd = 0;
  bit isTrap = 0;
  bit canFoldAsLoad = 0;
...

As you can the the Inst field has each bit defined (right at the top).

I am not sure how to implemented this the best way in Capstone currently. But from the top of my head I would say it is useful to have the encoding in Capstone Mapping.c::insn_map as an array of uint8_t for each bit.

The upper 4 bits encode the type of the bit (opcode, reg op, imm op, predicate, invalid/not used etc.) and the lower 4bits encode the index of the operand.

With index I mean not the index in a cs_insn (this is not known at generation time). But the index within the MCInst (the one dumped above).

The MCOperands indices are counted from left to right if you concatenate outs + ins. So in the example above Inst.operands[0] = Rd and Inst.operands[1] = Rn.

For this scenario you would need to extend PrinterCapstone::printInsnMapEntry() to emit the encoding.

The actual bit width calculation of each operand within the instruction, can then be done in Capstone.

Hope this helps and gives a few ideas :)

AngelDev06 commented 1 year ago

Indeed this helps a lot (which proves that tablegen docs lack some useful information), however I should mention that just providing the index of the operand to the instruction may not be enough to get the actual bit width of the operand. That's because operands may not always be next to each other such as in this example: Screenshot 2023-06-07 215433 where imm5 (immediate operand) is seperated by 3 hardcoded bits from the last register operand. So decrementing the index of the last register operand with the one of the immediate operand wouldn't give you the bit width of the immediate operand. And it gets even worse. Some operands don't have their own bits next to each other. That means that a 4 bit register operand for example may have 1 bit at a specific location and the rest 3 in another. Also what about ARM specific operands. Should I add them too? For example there is the condition field as shown in my example and the reglist operand in instructions like LDM: Screenshot 2023-06-07 220640

Rot127 commented 1 year ago

providing the index of the operand to the instruction may not be enough to get the actual bit width of the operand.

But if you have an array which describes every bit, couldn't you just iterate over it and count the bits which belong to each operand?

Also what about ARM specific operands. Should I add them too?

Yes. Those are still operands and someone else might need them.

Generally: Have you already read the docs how the architecture modules are designed? If not please do so and step with a debugger through this process.

The new encoding information should reside in the Mapping component I think. And whenever the detail for an operand is added via ARM_add_cs_detail() function, the bit width within the encoding can be added as well.

AngelDev06 commented 1 year ago

But if you have an array which describes every bit, couldn't you just iterate over it and count the bits which belong to each operand?

Well the way I have set it up right now isn't an array that describes each bit. In my opinion that would take way more space than it should. Also I just read from your previous response that you said I should encode both the type of the operand and its bit position in one byte. That really wouldn't work out since ARM instructions are 32 bits and just 4 bits to encode the bit position isn't enough. The structure I currently have created is the following: Screenshot 2023-06-08 230647 And I just have an array of that structure that is large enough to hold every operand encoding the instruction may have (and operands are in the order of left to right starting off from the output operands like the PrinterCapstone::printInsnOpMapEntry function does). I also already coded the function that generates a string that provides all this information and I will embed it in PrinterCapstone::printInsnMapEntry when I am done fixing a few bugs that are left.

The new encoding information should reside in the Mapping component I think. And whenever the detail for an operand is added via ARM_add_cs_detail() function, the bit width within the encoding can be added as well.

Isn't this function invoked only when detail is on? I might be wrong but the cs_x86_encoding struct is provided even when detail is off and therefore what I was planning to do is extend the C function ARM_set_instr_map_data (which is invoked in ARM_getInstruction right after decoding the instruction) to map the generated encoding and fill the struct in cs_insn. If this should be detail only then I should instead generate the operands' encodings in ARMGenCSMappingInsnOp.inc and map them from the insn_operands table.

Rot127 commented 1 year ago

I like your struct. It has way more detail. Please implement it in Mapping.* though. It should be usable by all auto-sync archs. Not just ARM.

In my opinion that would take way more space than it should.

No need to be concerned about space. We should wrap this into CAPSTONE_DIET guards, so this code can be excluded from the binaries completely, if space is of any concern.

And I just have an array of that structure that is large enough to hold every operand encoding

I would say that it is useful to have also the opcode bits in this array. Not just the operand bits. Because if we add this kind of info we can also add all of the known.

Isn't this function invoked only when detail is on? [...] I was planning to do is extend the C function ARM_set_instr_map_data.

Yes, this is the better idea.

But I do think that this definitely belongs into cs_detail. Since its use-case is pretty narrow.

@XVilka @kabeor Any opinions? Or know someone who might has an opinion?

Also cc @gogo2464.

XVilka commented 1 year ago

I think it definitely belongs to the cs_detail.

Rot127 commented 5 months ago

Closed due to https://github.com/capstone-engine/capstone/pull/2045#issuecomment-2068848678

capstone-engine / capstone

Encoding Information for ARM #2031