Naive proposal for avoiding duplication of effort ;-)

pfalcon commented 6 years ago

I know it never works like that, but one can only try, right?

So, I went running "git pull" thru my reverse engineering projects dump, and noticed that among many projects dead yours is still running "strong". As you guessed, I have the same problem, even more: I now have 3 times more commits and a bunch more code for program analysis. There're differences too:

Intermediate Language

This is not exactly LLVM's IL. The we use it as a "generic" assembly language for human consumption, not for precise directives to code generators. However, it is the intent that conversion to proper LLVM IL a trivial exercise in dumb text-processing

So, my project, https://github.com/pfalcon/ScratchABlock, explicitly rejects LLVM IR as something not really human readable. Nor it's trivial to produce LLVM IR, unless of course you mean to convert your registers to memory variables and load/store them. That's definitely doable, though probably not the dumbest of all the dumb text-processing, but then LLVM indeed will take care of converting your code to SSA form. Otherwise that needs to be done by you, and SSA conversion isn't exactly trivial.

What can be more human readable than C? That's the idea behind ScratchABlock's IL. It's more or less described in https://github.com/pfalcon/ScratchABlock/blob/master/docs/PseudoC-spec.md . Examples of decompilation can be found in https://github.com/pfalcon/ScratchABlock/tree/master/tests/decomp .

https://github.com/pfalcon/ScratchABlock/tree/master/tests/decomp/sub_5fac in particular was contributed by a guy who wrote PowerPC -> PseudoC converter. (The rest of example (growing) come from my Xtensa work).

Well, I'd be really interested to hear from someone trying it for "HP nanoprocessor" and other CPUs I never heard of. Feel free to give it a thought ;-).

Happy New Year!

bsdphk commented 6 years ago

100% agreement on the "hard to produce" and "not readable" of LLVM IL, but that's not really the fish I care about.

I will however also note, that for arcane old computers, which is my real target, the assembly code can be far worse.

For instance like this from a 42-bit GIER computer (http://ddhf.dk/wiki/GIER)

a10:pp_
1_
7,grna12
pmpc13,tln7
ck−10,grc5
tln12,grc11
tln20,grc4
pmp19c13,tln7

The main reason I want LLVM is so that I don't need to write (for instance) a "function-recognizer" for every single architecture, but can write one single one which recognizes all sequences of code which match the abstract pattern we think of as functions. (save return addres, do stuff, jump to saved return address)

The secondary reason is that as you go back in time, you find many "two-level" programs, one part of the program implements an interpreter in assembly, the rest of the program is byte-code for this interpreter. (CHIP8 is an example of this), and I would like to be able to disassemble both levels at the same time (but I don't expect to automatically infer the interpreted language, you will have to type in a disassembler for that as well.)

Finally, I want to be able to deal with really weird shared binaries, where you have multiple heterogenous instruction-streams mingled.

One example, although relatively trivial, is the CBM 1541 floppy drive which has two CPUs running out of the same EPROM, but there are worse things about, semi-autonomous math-processors, I/O processors, vector display engines etc.

As you notice I don't produce actual LLVM IL, but the intent is that it can be converted into "real" LLVM IL automatically.

The reason I don't go straight for LLVM IL, is that it is a high priority that it should be easy to add a new (old!) architecture, by reading through the technical manual for the CPU, and then be able to get machine-assistance to understand and curate the program. (I'm coming at this from a data-archaeology angle as you can hear)

(I'm quite proud of the "just type in the tables from the manual" disassembler, so far that is probably the most important thing of all this.)

Now that I've gotten some experience with producing the IL from instructions, there are some patterns I spot which can and will be automated.

For instance rotates and shifts through carry bits are annoying, and it should be possible to just write "pyreveng.shift.left(1, i1 %C, i8 %AH, i8 %AL, i1 0)" and get that expanded to "real IL" automatically.

Another sore point is decimal arithmetic, I havn't really started poking at that yet.

I have not entirely made up my mind about register aliasing (%BC vs %B and %C) which crops up in every single flag register in every single processor, but I have some ideas. Again, I want it to be easy to express intent when you type in the instruction set, and then have code do the tedious work of splitting/collapsing registers.

I did consider producing C code, but while it is more readable to humans, it is a nightmare to analyse with programs, so I decided to push one level in between and use LLVM IL, hoping that some or all the analysis tools people write for that can be exploited, including, hopefully, something which emits C from the LLVM IL.

The problem with all attempts to "level up" the language, is that you have to recognize the toplevel structure, in particular the functions, otherwise you just end up with a nightmare main() full of gotos.

As long as the code you work on is the output of a compiler for a "structured programming language", PASCAL, C, etc, that is very simple, but once you get further back in time you run into things like "Wheelgol", "Jensens Device" (look it up!) and handwritten and highly optimized assembler code.

So all in all, I'm pretty comfortable with my choice of LLVM IL for now, but if my hopes of leveraging LLVM IL tools do not pan out, I will probably ditch it again.

pfalcon commented 6 years ago

Thanks for the detailed response. Well, my motivation is simple - I "claim" that PseudoC has enough generality and expressiveness to represent any instruction set (and any sane set in a human-readable form), but of course, that needs testing. I unlikely would look into 16-bit and below any time soon, so seeing you're working on it, it could be a nice feedback. On your side, that could allow to play with decompilation as a low-hanging fruit. If that would sound useful in the future - please be my guest ;-).

pfalcon commented 6 years ago

I have not entirely made up my mind about register aliasing (%BC vs %B and %C) which crops up in every single flag register in every single processor, but I have some ideas.

Well, there're a couple of approaches, and a user just needs to select whatever better suits a particular usecase. Testing and trying that would be interesting in ScratchABlock, all the framework is there, but so I work only with RISC, which is devoid of such a problem. So again, would be interesting to see someone try it.

then have code do the tedious work of splitting/collapsing registers.

Yeah, that's exactly what ScratchABlock does.

pfalcon commented 6 years ago

Finally, I want to be able to deal with really weird shared binaries, where you have multiple heterogenous instruction-streams mingled.

Well, if there're really different CPUs, I'd say the approach should be to pre-split the binary and process each chunk separately.

But funny that you mention it - the problem is actually more acute if the single CPU has support for more than one instruction set and can switch between them dynamically. Like ARM and Thumb. Or MIPS and microMIPS. And hell, I don't even want to think of existence of a design with more than 2 ISAs. Anyway, that's exactly what I'm fighting with literally "today" (these days) - adding support for such modes to my another tool, https://github.com/pfalcon/ScratchABit . Adding just ARM or just Thumb mode was easy, but being able to work with both in the same project requires some redesigns...

bsdphk commented 6 years ago

Splitting heterogenous binaries up front only really works for "fat binaries" which contain the same source compiled for N architectures, and as interesting as "differential reverse enginering" might be, it's not on my TODO list any time soon :-)

The usual case is two different CPU's sharing a task, one CPU running assembler and some kind of interpreted language or a CPU and some kind of co-processor (FP/vector/graphics/crypto/network).

And it's not just "16bit and below", the GIER computer I mentioned above is 40, 20/20 or even 42 bits, depending how you look at it :-)

I think the biggest difference is that the primary goal of PyReveng is to make it easy to add an instruction decoder for any architecture, no matter how weird.

(Ever looked at the RCA1802 ? It has 16 16bit registers, and a separate four bit register designates one of them as program-counter. Wonderful for co-routines, horrible for disassemblers.)

The usual approach is either to focus on few architectures and stuff any amount of complexity into the actual instruction decoder, if it can improve subsequent analysis, or to only disassemble and do no analysis at all.

pfalcon commented 6 years ago

And it's not just "16bit and below", the GIER computer I mentioned above is 40, 20/20 or even 42 bits, depending how you look at it :-)

Well, nothing in PseudoC (the spec) is tied to bitness of CPU. ScratchABlock (a particular app for processing it) may be a different story, I'd definitely love to have support for completely arbitrary bitnesses, but before embarking on that, would prefer to bite 16 bits for starters, hence my comment.

I think the biggest difference is that the primary goal of PyReveng is to make it easy to add an instruction decoder for any architecture, no matter how weird.

I doesn't have to be a difference, because it's not. It can be a way to complement our projects. Because ScratchABlock by design is not concerned with "instruction decoding". It's outside of its scope. It takes pre-made PseudoC file and performs symbolic transformation on it. That's the whole nature of my proposal - if your tool generated PseudoC output, you could watch it transformed into something cute(r) by my tool. (And I would get more users to test my cute ideas ;-) ).

Ever looked at the RCA1802 ?

No. And never will :-I. (Life's too short for RE, and I hack only on 32-bit CPUs.) Likewise, I'm not sure if you looked in enough detail into peculiarities of LLVM IR and program transformation. I did, and as many other parties (https://github.com/pfalcon/ScratchABlock/blob/master/docs/ir-why-not.md) decided that sticking to LLVM IR leads to more issues than unsticking from it.

The usual approach is either to focus on few architectures

Exactly. I'm focusing on decompilation, and have to stick with one architecture at time (time == few years). That means I'm missing a chance to test generality and real-worldness of my approach with other archs, CPUs, and usecases. You focus on few obscure archs, and arguably would require a lot more effort to proceed further with full-fledged decompilation. One could see an opportunity for cooperation here ;-).

bsdphk commented 5 years ago

Sorry for not coming back to this sooner...

I'm all for colaboration and cooperation, and I'm certainly not married to LLVM-IR in any way.

As you may have seen I have only attempted the IR on a few CPUs yet, trying to get a feel for how both the process of adding it to a disassembler and for the resulting outputs utility.

Once I return to that side of the project, the plan is to do IR for a really odd-ball architecture, because that is really the acid test for my purposes: If IR cannot do that sanely, IR will be out of the project again.

bsdphk / PyReveng3

Naive proposal for avoiding duplication of effort ;-) #1