cgsecurity / testdisk

TestDisk & PhotoRec
https://www.cgsecurity.org/
GNU General Public License v2.0
1.61k stars 197 forks source link

Kaitai-powered Photorec #33

Open KOLANICH opened 7 years ago

KOLANICH commented 7 years ago

Kaitai Struct is a declarative language to declare file formats. Kaitai Struct compiler generates parsers for the formats provided with ksy definition.

1 signatures can be harvested from ksy files. Just find the field with "contents" property with fixed position (Kaitai Struct compiler precomputes offsets for every field). 2 ksy descriptions can be compiled into some interpreted language (lua, js, python, or byte-code (which is yet to be developed) ), which can be used by photorec to check format of files. a) for now we can check if enum fields have valid values and that sizes of nested structures are in aggreement with each other. b) checksum verification is to be developed yet

KOLANICH commented 6 years ago

@timofonic, you cannot use KS definitions in runtime for now (we don't have a bytecode target now), they need to be compiled first. So there are plenty of solutions. You can dynamically compile them into JavaScript and use. Or you can compile them into c++ or rust dynamic libraries (in fact you can't do it out of the box) and load them with dlopen. What you need is to just parse a format and if it is parsed without an error, it may mean that a file is not damaged, in this case you will likely (but not guaranteed, it depends on a spec, the spec may be crafted in a way to consume whole stream) get a correct size. We still don't have checksumm and other veriification, you may want to read the bug tracker, there are lot of tickets about planned features there.

KOLANICH commented 6 years ago

Does it have some similarities with DFDL?

Thank you for the link. We definitely need to look at this.

At first the project reminded me of TrID and it's definition list, but I'm probably wrong. It seems to only save the headers to recognize formats, but not sure about the full file structure.

Some file formats are potentially endless (I mean that the header doesn't contain count of frames) sequences of frames. To recover them full parsing is needed.

Dumb header matching here is for custom formats (I had to write a python script postprocessing the files recovered this way using the code derived from a Kaitai Struct definition, I guess that this feature should be built into photorec), all the smarter recovery is implemented in C here. For example https://github.com/cgsecurity/testdisk/blob/master/src/file_gif.c

KOLANICH commented 6 years ago

Would it be useful even for recovering stuff such as InnoDB tables? What about Windows Event Log files?

Not immediately: you would have to create (or convert, I work on a convertor from synalizis formats, they have innodb ) specifications for that first.

The formats library is https://github.com/kaitai-io/kaitai_struct_formats , everything having a signature there should be useful for photorec.