NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework
https://www.nsa.gov/ghidra
Apache License 2.0
49.77k stars 5.72k forks source link

SLEIGH grammar files - antlr or yacc? #2518

Open nightlark opened 3 years ago

nightlark commented 3 years ago

I noticed that there are two types of grammar files in the repository, ones using ANTLRv3 that seem to be used by the Java part of Ghidra, and yacc grammar files that are part of the cpp decompiler. I'm trying to figure out what the difference between them is.

  1. Are they just different implementations of the same thing? Which one is used for parsing the slaspec/cspec/pspec/etc files for processors (e.g. AARC64)?
  2. If they are just different implementations of the same parser, which is more up to date? Which one would be recommended for building additional tools that work with the sleigh files?
  3. If they are for parsing different things, what is the difference?
  4. How do the various parsers fit together -- is the SLEIGH/slaspec converted using the Java parser into some intermediate sleigh representation that the C/yacc parser takes in along with the xml files as part of the decompiler?

I basically don't know anything about this part of the code, and the documentation on them (particular the yacc files in the decompiler) is kinda lacking. What I could really use is someone to ELI5 how the various parsers/grammars in Ghidra fit together.

starfleetcadet75 commented 3 years ago

So my understanding, having played with the ANTLR files a bit myself, is:

  1. The ANTLR and yacc grammar files are both used for parsing SLEIGH specs (.slaspec). They should implement more or less the same grammar. The reason for this is that the decompiler is essentially a standalone program that you can run, so it needs to be able to load a SLEIGH spec independently from the rest of Ghidra. cspec/pspec/ldefs/opinion files are not parsed by the grammar files, they basically are just separate XML files that contain various metadata about the processors and tell Ghidra which *.slaspec files to associate with which processor defs. The relaxng specs can be found here.
  2. I've experimented a bit with the ANTLR files and would personally recommend using them over yacc.
  3. They both parse *.slaspec files.
  4. The SLEIGH compiler compiles the .slaspec and generates .sla files (XML) as output. These are what is primarily used. They can be regenerated at runtime or built ahead of time. There's a way to directly invoke the SLEIGH compiler from the command line.
caheckman commented 3 years ago

Both SLEIGH compilers are actively maintained and perform the same function; converting .slaspec files to the .sla format. They may give slightly different error messages but should produce identical .sla files from the same .slaspec input. See this test.

The yacc version is built on the decompiler c++ code, and the decompiler can be built to read in .sla files using this infrastructure. In the main build however, the decompiler doesn't include this infrastructure but gets all its p-code from the SLEIGH engine in the Java-based part of Ghidra. This engine doesn't really care which SLEIGH compiler produced the .sla file its using; however it considers .sla files to be ephemeral and builds them automatically using the Java/ANTLR compiler if they're not already present.

For tools that work with SLEIGH source, you can use whatever compiler code base is more convenient; yacc for native code or ANTLR for Java-based. Any proposed change to SLEIGH itself would need to be implemented in both code bases.