Open Rot127 opened 1 year ago
@kabeor @aquynh @Rot127 I propose to make the next release, with auto-sync changes a 6.0, not 5.1 because:
https://github.com/orgs/capstone-engine/projects/1 - then it would need to be updated too.
please can you summarize the API changes here?
@aquynh
ARM
ARM_CC_
-> ARMCC
VPUSH
, VPOP
).GROUP_INT
)
RET
, INT
should be added via Mapper separately.ARM_GRP_CRC
are renamed to match LLVM nameing: ARM_FEATURE_HasCRC
V8
, MCLASS
, ARM
, THUMB
) because instruction aliases are supported now. And those alias might change depending on enabled features.[]
brackets were added.writeback
is part of detail
and no longer of detail.arm
.r15 = pc
etc.) are no longer printed as default. Must be enabled via CS_OPT_SYNTAX_CS_REG_ALIAS
or -a
for the cstool
.uint32_t
, no longer int32_t
.PPC
PPC_BC_NU_PLUS
-> PPC_PRED_NU_PLUS
).cs_ppc.bc
.reg = r0
. This is fixed now.ppc_ops_crx
is removed (wasn't used).AArch64
ARM64
-> AArch64
(for filenames, enums variable names). Necessary to be consistent with LLVM.SME
operands changed (contin more detail, terminology is closer to the docs).This list is also part of the PR.
@aquynh
ARM
Enum changes:
ARM_CC_
->ARMCC
- System registers are renamed to match C++ namespaces. Also group Banked and system registers into different groups.
- Some instr. enum entries no longer exist (e.g.
VPUSH
,VPOP
).Some instruction groups which are not part of LLVM were removed (e.g.
GROUP_INT
)
- Groups like
RET
,INT
should be added via Mapper separately.- Feature groups like
ARM_GRP_CRC
are renamed to match LLVM nameing:ARM_FEATURE_HasCRC
- Features are now checked more strictly (
V8
,MCLASS
,ARM
,THUMB
) because instruction aliases are supported now. And those alias might change depending on enabled features.- The memory offset register or immediate are now always part of the memory operand. Offsets or index operands are no longer separated. Before, only offset ops which were within the
[]
brackets were added.writeback
is part ofdetail
and no longer ofdetail.arm
.- Register alias not defined in LLVM (
r15 = pc
etc.) are no longer printed as default. Must be enabled viaCS_OPT_SYNTAX_CS_REG_ALIAS
or-a
for thecstool
.- The immediate value of operands is no of type
uint32_t
, no longerint32_t
.PPC
- Predicate enums members are renamed. They now use the LLVM name (e.g.
PPC_BC_NU_PLUS
->PPC_PRED_NU_PLUS
).- Branch conditions are now saved in more detail in
cs_ppc.bc
.- The base register of an PPC memory operand was not present if
reg = r0
. This is fixed now.ppc_ops_crx
is removed (wasn't used).AArch64
- Renamed all
ARM64
->AArch64
(for filenames, enums variable names). Necessary to be consistent with LLVM.SME
operands changed (contin more detail, terminology is closer to the docs).- System operands change (now categorized into SysAlias, SysImm, SysReg).
This list is also part of the PR.
cant we avoid breaking compatibility?
@aquynh The short answer is no.
But let me go into more details also for others:
The problem with automatic Capstone updates is that due to the C++ and C difference we have many cases to handle when C is not equivalent to C++.
To reduce those cases we need to be as close to the original C++ syntax and semantic as possible. Because every renaming (i.e. enum values), semantic and overall design changes, almost always add manual work during an update.
This is why those breaking changes are needed. Each of them moves Capstone code semantically or syntactically closer to the LLVM definitions.
This is of cause a pain for compatibility, but it is definitely worth it in the long run. Because:
auto-sync
archs are semantically pretty much equivalent to LLVM.
llvm-objdump
.auto-sync
archs without much trouble).Mapping.*
files).Here more detail to each breaking change.
Enum changes
:
Done to match LLVM naming. It saves us to change enum names over several files whenever we update.
ARM
Some instruction groups which are not part of LLVM were removed (e.g.
GROUP_INT
)
Capstone unique instr. groups (like RET
, INT
) are added via Mapper separately (is on the toddo list). Because they are not defined in LLVM, we can not generate them without adding exceptions again.
Features are now checked more strictly (
V8
,MCLASS
,ARM
,THUMB
) because instruction aliases are supported now. And those alias might change depending on enabled features.
Simple necessity, because with the new instructions the same bytes have a valid decoding depending on the enabled features.
The memory offset register or immediate are now always part of the memory operand. Offsets or index operands are no longer separated. Before, only offset ops which were within the
[]
brackets were added.
Move closer to the LLVM logic. The disponent of a memory access doesn't need to be within the []
brackets (e.g. strt fp, [sp], 4
). But the disponent is defined as part of the memory operand. This was incorrectly represented in CS before.
writeback
is part ofdetail
and no longer ofdetail.arm
.
We support now the concept of Tied
operands (the way LLVM describes writeback registers). So writeback information is now known for all auto-sync
archs.
Register alias not defined in LLVM (
r15 = pc
etc.) are no longer printed as default. Must be enabled viaCS_OPT_SYNTAX_CS_REG_ALIAS
or-a
for thecstool
.
As said before, modules will be more equivalent to the llvm-objdump
results. Also the register naming and the decoded asm string.
The immediate value of operands is not of type
uint32_t
, no longerint32_t
.
See: https://github.com/capstone-engine/capstone/issues/2056
PPC
Branch conditions are now saved in more detail in
cs_ppc.bc
.
Just a nice feature we are now able to provide.
The base register of an PPC memory operand was not present if
reg = r0
. This is fixed now.
A semantical fix. The base register should have been set.
ppc_ops_crx
is removed (wasn't used).
Wasn't used.
AArch64
Renamed all
ARM64
->AArch64
(for filenames, enums variable names). Necessary to be consistent with LLVM.
This is a big one. But having two names for the same architecture in the code is a nightmare for generation. Also it just doesn't bring any value. Being closer to LLVM is the choice here.
SME
operands changed (contain more detail, terminology is closer to the docs).
Again a nice feature addition because we save more detail. Being closer to the official docs when it comes to naming eases integrating Capstone in other projects.
System operands change (now categorized into SysAlias, SysImm, SysReg).
Again, move to LLVM semantic because:
Personally, I think that Capstone will become more and more irrelevant as disassembler engine if we:
PPC
) and others).If we do not go through the pain of breaking compatibility once, to gain log term improvements, Capstone just won't be competitive to other disassemblers in the future.
I understand that you want to move towards C++ because llvm is C++, but i dont see why that is important.
I dont mind adding new things so we can gradually move to better naming in the next versions, but at the same time lets try not to rename or remove anything so we have backward compatibility. I think it can be tricky, but it is doable.
I very much appreciate your effort so far. Thanks a lot for your great work 🙏
I never say it is not merged, Anton.
On Mon, Jun 26, 2023, 23:12 Anton Kochkov @.***> wrote:
If auto-sync work is not merged, I am afraid we have to fork the capstone. It's your choice - you want updated architectures or not.
— Reply to this email directly, view it on GitHub https://github.com/capstone-engine/capstone/issues/2015#issuecomment-1607692581, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNQNYGJCQLEGSSBCAUYCGTXNGRFNANCNFSM6AAAAAAX3POTSU . You are receiving this because you were mentioned.Message ID: @.***>
I am taking about having backward compatibility, and try to see how much compromise we can have.
Is there any miscommunication?
On Mon, Jun 26, 2023, 23:23 Nguyen Anh Quynh @.***> wrote:
I never say it is not merged, Anton.
On Mon, Jun 26, 2023, 23:12 Anton Kochkov @.***> wrote:
If auto-sync work is not merged, I am afraid we have to fork the capstone. It's your choice - you want updated architectures or not.
— Reply to this email directly, view it on GitHub https://github.com/capstone-engine/capstone/issues/2015#issuecomment-1607692581, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNQNYGJCQLEGSSBCAUYCGTXNGRFNANCNFSM6AAAAAAX3POTSU . You are receiving this because you were mentioned.Message ID: @.***>
lets take one example: we want to rename ARM_CC
to ARMCC
.
can we have compatibility by keeping ARM_CC
, and add (new) ARMCC
, so everyone is happy?
I answer your suggestions later. On the phone it is difficult to write well.
As a general note though:
I second @XVilkas here.
I work on this now for half a year and opened the ARM PR as draft after two-three months. Exactly to ask for this kind of feedback, suggestions on design and other choices.
This is also why I added the list of breaking changes to the PR description, updated it continuously and asked for feedback on the big ones. So no one needed to read through the code and can save time.
Also, I think I made clear that I am more than happy to provide detailed answers and provide more details if requested.
As I stated in the ARM PR, my time I can spend on this is limited (until end of July). And we want and need to start building on it in Rizin.
With all due respect, but there were at least four and a half months to discuss a big decision like this. And we really need to carry on.
26 Jun 2023 17:23:51 Nguyen Anh Quynh @.***>:
I never say it is not merged, Anton.
On Mon, Jun 26, 2023, 23:12 Anton Kochkov @.***> wrote:
If auto-sync work is not merged, I am afraid we have to fork the capstone. It's your choice - you want updated architectures or not.
— Reply to this email directly, view it on GitHub https://github.com/capstone-engine/capstone/issues/2015#issuecomment-1607692581, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNQNYGJCQLEGSSBCAUYCGTXNGRFNANCNFSM6AAAAAAX3POTSU . You are receiving this because you were mentioned.Message ID: @.***>
— Reply to this email directly, view it on GitHub[https://github.com/capstone-engine/capstone/issues/2015#issuecomment-1607714654], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AK5ET6CWWG5MAD4QVHRZWLTXNGSQLANCNFSM6AAAAAAX3POTSU]. You are receiving this because you were mentioned.[Tracking image][https://github.com/notifications/beacon/AK5ET6B4GEQVBVWGEALNKVLXNGSQLA5CNFSM6AAAAAAX3POTSWWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS72PDV4.gif]
I answer your suggestions later. On the phone it is difficult to write well. As a general note though: I second @XVilkas here. I work on this now for half a year and opened the ARM PR as draft after two-three months. Exactly to ask for this kind of feedback, suggestions on design and other choices. This is also why I added the list of breaking changes to the PR description, updated it continuously and asked for feedback on the big ones. So no one needed to read through the code and can save time. Also, I think I made clear that I am more than happy to provide detailed answers and provide more details if requested. As I stated in the ARM PR, my time I can spend on this is limited (until end of July). And we want and need to start building on it in Rizin. With all due respect, but there were at least four and a half months to discuss a big decision like this. And we really need to carry on. 26 Jun 2023 17:23:51 Nguyen Anh Quynh @.>: … I never say it is not merged, Anton. On Mon, Jun 26, 2023, 23:12 Anton Kochkov @.> wrote: > If auto-sync work is not merged, I am afraid we have to fork the capstone. > It's your choice - you want updated architectures or not. > > — > Reply to this email directly, view it on GitHub > <#2015 (comment)>, > or unsubscribe > https://github.com/notifications/unsubscribe-auth/ABNQNYGJCQLEGSSBCAUYCGTXNGRFNANCNFSM6AAAAAAX3POTSU > . > You are receiving this because you were mentioned.Message ID: > @.***> > — Reply to this email directly, view it on GitHub[#2015 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AK5ET6CWWG5MAD4QVHRZWLTXNGSQLANCNFSM6AAAAAAX3POTSU]. You are receiving this because you were mentioned.[Tracking image][https://github.com/notifications/beacon/AK5ET6B4GEQVBVWGEALNKVLXNGSQLA5CNFSM6AAAAAAX3POTSWWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS72PDV4.gif]
understood - we are all short of time, especially those who are maintaining this project without paying, in spare time.
we will try to merge this in July.
@aquynh I agree that the compatibility concern is very valid. But since I asked in the PRs and got no push back I assumed modernizing is more important.
I would propose to finish up the v5
release and add a note that v6
will bring big change.
If people try out the next branch and figure they rely desperately on some of those old stuff, we can think about how to make it compatible for them in a different branch. Or guide them to do this on their own.
can we have compatibility by keeping ARM_CC, and add (new) ARMCC, so everyone is happy?
This specifically is not just a syntax change, but also the values change (ARM_CC_invalid = 0
is removed with ARMCC_UNDEF = 15
). Reversing this also means to:
CC
enums (for CS and LLVM)But there is little meaning in keeping this complexity (other then compatibility reasons of cause).
Because it takes longer than I expected, I suggest targeting upcoming LLVM 17.0 release with a few nice updates in ARMv9 and RISC-V extensions: https://discourse.llvm.org/t/llvm-17-0-0-release-planning-and-update/71762
https://llvm.org/docs/ReleaseNotes.html#non-comprehensive-list-of-changes-in-this-release
@XVilka In that way, should we merge #1949 after 17.x release?
Suggest to continue this topic at https://github.com/capstone-engine/llvm-capstone/pull/11
It would probably also make sense to remove now obsolete suite/synctools
as well, especially after AArch64 PR is merged.
@xen0n As we finished most of the planned architectures and refactorings, with instruction details being the major missing piece, it's a good time to start implementing other architectures as well e.g., LoongArch.
@jiegec, as I noticed your PR in the Ghidra repository, you might also be interested in this particular project as Capstone is a popular library already used by many different projects, including QEMU; adding LoongArch support in it might be quite handy for many.
@jiegec, as I noticed your PR in the Ghidra repository, you might also be interested in this particular project as Capstone is a popular library already used by many different projects, including QEMU; adding LoongArch support in it might be quite handy for many.
Thanks, I will work on it.
@jiegec, as I noticed your PR in the Ghidra repository, you might also be interested in this particular project as Capstone is a popular library already used by many different projects, including QEMU; adding LoongArch support in it might be quite handy for many.
Got error running ASUpdater.py -a AArch64
:
INFO - Clean build directory
INFO - Generating Disassembler tables...
INFO - Generating AsmWriter tables...
INFO - Generating RegisterInfo tables...
INFO - Generating InstrInfo tables...
INFO - Generating SubtargetInfo tables...
INFO - Generating Mapping tables...
INFO - Generating SystemOperand tables...
INFO - Compile Cpp language
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp
INFO - creating /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src
INFO - cc -fPIC -std=c99 -I/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src -c /home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/parser.c -o /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/parser.o
INFO - cc -fPIC -std=c99 -I/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src -c /home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/scanner.c -o /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/scanner.o
INFO - cc -shared /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/parser.o /tmp/tmpe_8kge8xtree_sitter_language/home/jiegec/capstone/capstone/suite/auto-sync/vendor/tree-sitter-cpp/src/scanner.o -o /home/jiegec/capstone/capstone/suite/auto-sync/vendor/ts_cpp.so
INFO - Load language '/home/jiegec/capstone/capstone/suite/auto-sync/vendor/ts_cpp.so'
INFO - Unresolved template calls: dict_keys([b'unsigned UnscaledVal = MI->getOperand(OpNum).getImm();']). Patch them by hand!
INFO - Translate '/home/jiegec/capstone/capstone/suite/auto-sync/llvm-capstone/llvm/lib/Target/AArch64/Disassembler/AArch64Disassembler.cpp'
Traceback (most recent call last):
File "/home/jiegec/capstone/capstone/suite/auto-sync/./Updater/ASUpdater.py", line 228, in <module>
Updater.update()
File "/home/jiegec/capstone/capstone/suite/auto-sync/./Updater/ASUpdater.py", line 141, in update
self.translate()
File "/home/jiegec/capstone/capstone/suite/auto-sync/./Updater/ASUpdater.py", line 118, in translate
translator.translate()
File "/home/jiegec/capstone/capstone/suite/auto-sync/Updater/CppTranslator/CppTranslator.py", line 390, in translate
query: Query = self.ts_cpp_lang.query(pattern)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiegec/capstone/capstone/venv/lib/python3.11/site-packages/tree_sitter/__init__.py", line 93, in query
return _language_query(self.language_id, source)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: Invalid syntax at offset 43
@jiegec I think this one might help: https://github.com/capstone-engine/capstone/pull/2255
@jiegec Added you in the list above for LoongArch. Let me know if you want to be removed again.
@jiegec Added you in the list above for LoongArch. Let me know if you want to be removed again.
@Rot127 As discussed in the rizin Mattermost I have some free time and would like to update the BPF support to auto-sync :raised_hands:
Great! Please go ahead. Start with the documentation and let me know if you need help with something. If something is not clearly written or needs clarification in you opinion please let me know as well. The docs haven't been read by many people yet. So any fresh looks at it are welcome.
Also notify me when you have a fork pushed and a draft PR opened. So we can link it here. Draft PR is preferred, because we can comment on it more easily.
@jiegec Just wanted to ask quickly, how it is going with LoongArch? Please be aware of https://github.com/capstone-engine/llvm-capstone/pull/45 when developing further. My plan is to merge it after ARM
, AArch64
and PPC
are updated here in the Capstone repo. But better use this https://github.com/capstone-engine/llvm-capstone/pull/45 for further development, since there were almost certainly changes to LoongArch since LLVM 16.
@jiegec Just wanted to ask quickly, how it is going with LoongArch? Please be aware of capstone-engine/llvm-capstone#45 when developing further. My plan is to merge it after
ARM
,AArch64
andPPC
are updated here in the Capstone repo. But better use this capstone-engine/llvm-capstone#45 for further development, since there were almost certainly changes to LoongArch since LLVM 16.
Sorry, I have forgotten this thing after a long spring vacation.. I will continue to work on it.
- The memory offset register or immediate are now always part of the memory operand. Offsets or index operands are no longer separated. Before, only offset ops which were within the
[]
brackets were added.
It's a great change, but it isn't coherent with the AArch64 architecture in the current state of the next
branch:
$ cstool -d aarch64 "01 a4 40 f8"
0 01 a4 40 f8 ldr x1, [x0], #0xa
ID: 621 (ldr)
op_count: 3
operands[0].type: REG = x1
operands[0].access: WRITE
operands[1].type: MEM
operands[1].mem.base: REG = x0
post-indexed: true
operands[1].access: READ
operands[2].type: IMM = 0xa
operands[2].access: READ
Write-back: True
Registers read: x0
Registers modified: x0 x1
Some time ago, at the commit 77710a8100a5b7a6d9a0bf4cbfb681b50c8356b4 (the commit is unrelated, it's just the commit I tested with previously), there were only two operands like in ARM with the immediate being a displacement on the memory operand.
I didn't know if I should open an issue about that, but being coherent for post-index incrementation is probably best.
@Quentin01 Yes please open an issue about these in the future. This one I fix now like this.
Note to x86:
x86
is not part of this list, because we can not generate all tables in C. Refer to https://github.com/capstone-engine/llvm-capstone/issues/13 for details.Note about changes introduced with
auto-sync
: For a preview what changes will come inv6
, please take a look at the WIP release guide.This issue tracks the
auto-sync
refactoring and implementation effort of architecture modules.The table below lists the responsible developers for each architecture.
In progress
v6
v6
v6
.td edits upstreamed
Most LLVM
td
files miss some information about instructions (memory read/writes, operands incorrectly assigned as in/out etc.). Since we rely on this we need to fix it. Those fixes should be upstreamed to LLVM.Alhpa(no longer maintained)Done
v6
release v3.0
)v6
v6
v6
v5
v6
v6
v6
v6
v6
Arch extensions
Adding CPU extensions which are not part of upsteram LLVM is easier now. Here are they tracked.
Effort level of not refactored/implemented archs
td
files faultytd
files faultyNote to RISC-V: RISC-V will not be generated via LLVM because the LLVM architecture definitions are not precise enough for our use case. Instead, a SAIL based generator will be used (https://github.com/capstone-engine/capstone/issues/2392).
Legend
Number of operand groups
: Operand groups which have a distinctprint
functions. Indicates effort to implement the LLVM <-> CS mapping code (fillcs_detail
and the like).Generates
:inc
files generate with most recent backends.Note
: Worthy to note.Implementation type
: Refactor current implementation or implement new arch module.Difficulty level
: Guessed difficulty of this arch (base on points above and complexity like number of instructions etc.). Though "Easy" still means you have to familiarize yourself how LLVM definitions and the updater work. My guess is it will take at least a week of work.Getting started
auto-sync
documentation to learn how to refactor or implement an architecture withauto-sync
TODO for refactored archs
List of missing things which should be done before
v6
to get a nice round package.Capstone
ASUpdater.py
instructions.assert
version and add the asserts to the LLVM files again.CAPSTONE_DIET
.name2id
docs. Parametermax
should be changed to table size and in the loop bemax - 1
CAPSONE_DIET
).PPC
instruction formats on the public interfaceLLVM revisions
pop
alias (d64f74921cc5215bbd5d0252e710dc6f8b08da9a)FeatureAll
checkpredicates
Auto-Sync
refactor
setting toauto-sync
updater.Backends
ARM
post_index
when base regsister is tied. Just to make sure to hit every case.Encoding infoPPC
Encoding infoAArch64
Encoding info