airbus-seclab / cpu_rec

Recognize cpu instructions in an arbitrary binary file
Apache License 2.0
657 stars 60 forks source link

Corpus needs more 6502 samples #3

Closed trou closed 6 years ago

trou commented 6 years ago

It fails to recognize the following files as 6502 code:

LRGH commented 6 years ago

I agree, and it is documented: this architecture is named #6502#cc65instead of 6502, in cpu_rec.py you can read

   # 6502 binary compiled with https://github.com/cc65/cc65
   # This appears to be more compiler-dependent than CPU-dependent, the
   # statistics are very different from an AppleII ROM, for example.

and the paper published at SSTIC says: le code 6502 fabriqué par https://github.com/cc65/cc65 est caractéristique du compilateur plus que du CPU

If you can provide a sufficiently large amount of 6502 code that would be characteristic of this CPU, I can add it to the corpus.

The code from the Atredis challenge could be a good starting point (according to http://www.msreverseengineering.com/blog/2018/7/24/the-atredis-blackhat-2018-ctf-challenge it has some characteristic sequences of instructions, e.g. LDA #0 followed by RTS. But if I add it (e.g. by copying https://raw.githubusercontent.com/RolfRolles/Atredis2018/master/MemoryDump/data-4000-efff.bin in cpu_rec_corpus/#6502#Atredis.corpus it does not recognize osi_bas.bin as being 6502. It is because this file contains too many non-code data: the text at its start and large chunks of zeroes. Therefore you should extract from data-4000-efff.bin the chunks containing 6502 code. But the resulting 6502 corpus is small, and probably not sufficient to characterize this cpu.

My criterion for being happy and adding a new architecture to the corpus is if I can learn this architecture on some file, and recognize this architecture in another file from a completely different source. The issue with 6502 is that my biggest source (the Apple II ROMs) are not free and therefore cannot be included in the published corpus.

LRGH commented 6 years ago

It seems that https://raw.githubusercontent.com/RolfRolles/Atredis2018/master/MemoryDump/data-4000-efff.bin contains only slightly more than 1300 bytes of 6502 code (starting at position 0x4000 in this file, which is the memory address 0x8000). Nevertheless, I have added this data to the default corpus, under the name 6502, because it is sufficient to recognise osi_bas.bin and APPLE.ROM as being 6502. The result is:

Target File:   corpus/6502/data-4000-efff.bin
MD5 Checksum:  827998bbc4a941b52b8e19b1f2724bd7

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             None (size=0x4000, entropy=0.043476)
16384         0x4000          6502 (size=0x400, entropy=0.691858)
17408         0x4400          None (size=0x6c00, entropy=0.018228)

Target File:   corpus/6502/osi_bas/osi_bas.bin
MD5 Checksum:  b331075b878624bfa65757677f01ea87

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             6502 (size=0x1c00, entropy=0.877842)
7168          0x1C00          None (size=0x2400, entropy=0.131216)

Target File:   corpus/6502/APPLE.ROM
MD5 Checksum:  58ddc617555e2fc242b20e7f86165ab2

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             None (size=0x800, entropy=0.801389)
2048          0x800           6502 (size=0x3600, entropy=0.902971)

More 6502 sample should be useful, but at least with what you have provided, there is some 6502 recognition available.

trou commented 6 years ago

https://www.von-bassewitz.de/cgi-bin/ftp-portal.pl?url=ftp://ftp.musoftware.de/pub/uz/cbm610/kernal610-orig.zip could be used as a sample. It's the original code running on a CBM610

LRGH commented 6 years ago

Indeed, with the new data coming from the Atredis challenge, it is recognised as 6502.

Target File:   corpus/6502/kernal610-orig/kernal.bin
MD5 Checksum:  5d6f6428ff1c2a58225a04092621c7b6

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             6502 (size=0xa00, entropy=0.861012)
2560          0xA00           None (size=0x400, entropy=0.704522)
3584          0xE00           6502 (size=0xc00, entropy=0.845523)
6656          0x1A00          None (size=0x600, entropy=0.808449)

Despite its small size, the Atredis code works suprisingly well to recognise 6502. My tests did not find any non-6502 code that is recognised as 6502, but there is a risk.