NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework
https://www.nsa.gov/ghidra
Apache License 2.0
50.95k stars 5.81k forks source link

Simplify C header parser #1197

Open teaalltr opened 4 years ago

teaalltr commented 4 years ago

Parsing C headers to archives is a pain. A naive approach would be parsing headers in a folder (recursively in each subfolder) in every possible order (i.e. changing the sorting of headers) until succeeds. Backtracking could also be used. Command line options too could be handled this way.

This could be implemented as an option (it may require some time to complete)

astrelsky commented 4 years ago

This may not exactly be helpful, but at the moment I find it much more reliable and easier to pull the datatypes from the debug libraries and create datatype archives from the extracted dwarf information. Then again, this requires having access to the debug versions of the libraries and it can only get the types that are used by the library itself.

emteere commented 4 years ago

If you just choose just the directory that has your set of header files to be imported, then an algorithm which attempts to select the root header files in the correct order is added to the set of files to parse.

Really it isn't much different than trying to find the correct order for include files to parse when writing code.

I agree with both comments, they are a pain to deal with, and if you have good debug information that can be a better source of data types information.

Parsing header files attempts to discover values of defines from macros and add them as enums which you lose from debug information. There are some changes planned to better recover from multiple definitions. Currently only the last definition for a define is kept.

An additional change to pull unknown datatypes from any open archive at the time of parse could help as well. If you were to compile code that included a header file the correct pre-include files would be necessary as well.

astrelsky commented 4 years ago

If you just choose just the directory that has your set of header files to be imported, then an algorithm which attempts to select the root header files in the correct order is added to the set of files to parse.

Really it isn't much different than trying to find the correct order for include files to parse when writing code.

I agree with both comments, they are a pain to deal with, and if you have good debug information that can be a better source of data types information.

Parsing header files attempts to discover values of defines from macros and add them as enums which you lose from debug information. There are some changes planned to better recover from multiple definitions. Currently only the last definition for a define is kept.

An additional change to pull unknown datatypes from any open archive at the time of parse could help as well. If you were to compile code that included a header file the correct pre-include files would be necessary as well.

It would be helpful to have special defines for thing like abstract integer sizes. It is currently impossible to parse a struct with a bit field of type long with the size greater than 32 because even if the define for setting the long size is set the parser still assumes it 4 bytes and throws an error.

cmorin6 commented 4 years ago

A workaround that I found to parse c header file is to use intermediate files.

Intermediate files can be generated using the -S -save-temps gcc option. These intermediate .i filed are the result of the precompilation stage with all macro expended and all header files are merged in the right order within a single file with some debug comments added.

Steps for the workaround

  1. create a .c file containing only #include statement that you would use if you were actually using the library in a c project.
  2. use gcc again as if you were building a standard c project with this library plus the -S -save-temps option. This will generate a .i file.
  3. from there you have two options: 3.1. Quick and dirty: you can use a script to remove debug comment lines starting with "#" from the intermediate file. Rename this file as .h and import it in ghidra as a regular header file. This will import everything including standard library types and all these types will be considered as coming from the same big header file. 3.2. A cleaner alternative is to use the file path contained in each debug comment to reconstruct the header import tree. From there you can create a new header file structure that matches the original one except all macros are expended so ghidra will import them way easily. What is interesting with this approach is that since you control the header import tree, you can alter it to include only header files from some directory (ie. exclude standard library files and import only user library's types). Also this preserves the actual header file names for each imported type in the final archive. To get ghidra to create archives with only new types while reusing the existing types from opened archives I had to make the following changes cparsing-reuse-open-archive.txt.

Some issues I encountered I still had to make some manual adjustment to my new header files for some edge cases:

(These issues aren't introduced by the workaround, they are the only remaining limitations of ghidra's CPreParser after using the workaround)

I used this workaround successfully to create archives for the fairly big GTK-2.0 library on Ubuntu 18.04.

kiler129 commented 3 years ago

I will say the current parser is not insufficient - it's unusable. I tried to import even partial headers of Linux kernel types and I gave up after 2 hours of fighting with errors like error in line -20.