NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework
https://www.nsa.gov/ghidra
Apache License 2.0
52.09k stars 5.9k forks source link

ability to import doxygen xml output for datatype archive creation #1301

Open binlaveloos opened 4 years ago

binlaveloos commented 4 years ago

Is your feature request related to a problem? Please describe.

The current c source parser is very primitive and barely useable for creating datatype archives for complex object oriented (and heavily templated) libraries such as crypto++ or libstdc++.

Describe the solution you'd like

The ability to import doxygen generated xml files to create datatype archives. The doxygen sourcecode parser is very mature, and is able to create very nice structured xml output, even when the sourcecode is not documented at all. As a bonus, many well-known opensource libraries have excellent doxygen-compatible comments in the sources for classess, methods, function arguments, and variables. Doxygen also includes full inheritance information, so it seems to me that the doxygen xml files are perfectly suited for dataype archive creation. We all know that writing a parser for complex object oriented C++ sources is pretty hard, so why not leave it to doxygen, to the hard work! Its around for a long time, and lots of libraries already use it for their documentation.

Describe alternatives you've considered

I tried making datatype archives using the c-parser, i found it to be unusable for the stuff i tried it with.. I also tried compiling the libraries with DWARF debug symbols. This is working a lot better. The problem is the vast amount of work involved in the process... And the fact that you've got to setup a specific (virtual) compilation environment for each architecture/compiler combination. Symbol information generated from doxygen XML is much more platform/compiler independent than the dwarf symbols i think. And for PDB symbols, you have the additional problem that it's a proprietary format..

astrelsky commented 4 years ago

This may be sort of unrelated, but may you link me to more information about generating this information from doxygen please? I've become intrigued.

binlaveloos commented 4 years ago

if you download the crypto++ library for example, it uses doxygen and already has a configuration file for doxygen, called Doxyfile with it.

To enable the xml output format, change the line GENERATE_XML = NO to : GENERATE_XML = YES

you then simply type: doxygen Doxyfile and it creates a xml folder (inside the html folder) with the xml files.

If the sources do use doxygen for documentation, no problem, you can generate a default config (named Doxyfile_default) with : doxygen -g Doxyfile_default for sources that do not use doxygen, be sure to enable: EXTRACT_ALL = YES

further options to consider enabling are: EXTRACT_PRIVATE = YES EXTRACT_PACKAGE = YES EXTRACT_STATIC = YES EXTRACT_LOCAL_METHODS = YES EXTRACT_ANON_NSPACES = YES these are for including private members of a class, static members of a file, and local classes and methods, anonymous namespaces etc.

quote from the manual : "The XML output consists of an index file named index.xml which lists all items extracted by doxygen with references to the other XML files for details. The structure of the index is described by a schema file index.xsd. All other XML files are described by the schema file named compound.xsd. If you prefer one big XML file you can combine the index and the other files using the XSLT file combine.xslt."

doxygen creates a XXXcpp.xml file for every source file XXX.cpp doxygen creates a XXXh.xml file for every .h header file XXX.h doxygen creates a class_XXX.xml file per class doxygen creates a namespace_XXX.xml per namespace doxygen creates a struct_XXX.xml per struct

all items are referenced by a "refid" attribute (from the index.xml) most things are contained in a compounddef tag.

pabx06 commented 4 years ago

clang has an nice api/ libtooling. i think somme nice feature would be to make clang plugin to make dataTypeArchive

aerosoul94 commented 4 years ago

This was my solution: castxml (uses clang) https://github.com/CastXML/CastXML and https://github.com/aerosoul94/GhidraScripts/blob/master/GhidraCastXML.py

pabx06 commented 4 years ago

@aerosoul94 good to know i was doing by hand : cross-compile lib for target with debug symbols. load it in ghidra ,create empty file data type archive copy -past to from project datatype . save. close archive then load that into target. problems had to remove some #ifdef debug field from structure very anoying. and had to make sure it was commited back to archive dt instead of project st. very annoying ... but works nicely...

pabx06 commented 4 years ago

Datatype archive like shooting fish in a barrel: load libs in a subdir of the project with headless mode i made 200 archives in a few minutes.

//TODO write a description for this script
//@author 
//@category _NEW_
//@keybinding 
//@menupath 
//@toolbar 

import java.io.File;
import java.util.ArrayList;
import java.util.List;

import ghidra.app.plugin.core.datamgr.archive.SourceArchive;
import ghidra.app.script.GhidraScript;
import ghidra.util.UniversalID;
import ghidra.program.model.data.*;

public class DataTypeExport extends GhidraScript {

    private static final String ARCHIVE_CREATION_DESTINATION_DIR = "/tmp/%s.gdt";

    public void run() throws Exception {
        if ( !isRunningHeadless() ) {           
            println(String.format("$HOME/ghidra_9.1.2_PUBLIC/support/analyzeHeadless $PROJECT_DIR $PROJECT_NAME/subdir/ -scriptPath $HOME/ghidra_scripts -postScript DataTypeExport.java -process -recursive"));
        }

        DataTypeManager dtm = getCurrentProgram().getDataTypeManager();     
        UniversalID uid =  dtm.getUniversalID();
        SourceArchive sa= dtm.getSourceArchive(uid);
        println(String.format("Name=%s UID=%s SA=%s Type=%s",dtm.getName(),uid,sa,sa.getArchiveType()));

        File outputFilename = new File(String.format(ARCHIVE_CREATION_DESTINATION_DIR,dtm.getName()));
        FileDataTypeManager fdtm = FileDataTypeManager.createFileArchive(outputFilename);
        List<DataType> dtList = new ArrayList<DataType>(dtm.getDataTypeCount(true));

        dtm.getAllDataTypes(dtList);
        int transactionID = fdtm.startTransaction("Initial Creation");
        for(DataType aDt: dtList) {
            fdtm.addDataType(aDt, DataTypeConflictHandler.REPLACE_HANDLER);
        }
        fdtm.endTransaction(transactionID, true);
        fdtm.save();
        fdtm.close();
    }

}