dtolnay / clang-ast

Deserialization logic for efficiently processing Clang's `-ast-dump=json` format
Apache License 2.0
132 stars 13 forks source link

-ast-dump=json

github crates.io docs.rs build status

This library provides deserialization logic for efficiently processing Clang's -ast-dump=json format from Rust.

[dependencies]
clang-ast = "0.1"


Format overview

An AST dump is generated by a compiler command like:

$  clang++ -Xclang -ast-dump=json -fsyntax-only path/to/source.cc

The high-level structure is a tree of nodes, each of which has an "id" and a "kind", zero or more further fields depending on what the node kind is, and finally an optional "inner" array of child nodes.

As an example, for an input file containing just the declaration class S;, the AST would be as follows:

{
  "id": "0x1fcea38",                 //<-- root node
  "kind": "TranslationUnitDecl",
  "inner": [
    {
      "id": "0xadf3a8",              //<-- first child node
      "kind": "CXXRecordDecl",
      "loc": {
        "offset": 6,
        "file": "source.cc",
        "line": 1,
        "col": 7,
        "tokLen": 1
      },
      "range": {
        "begin": {
          "offset": 0,
          "col": 1,
          "tokLen": 5
        },
        "end": {
          "offset": 6,
          "col": 7,
          "tokLen": 1
        }
      },
      "name": "S",
      "tagUsed": "class"
    }
  ]
}


Library design

By design, the clang-ast crate does not provide a single great big data structure that exhaustively covers every possible field of every possible Clang node type. There are three major reasons:


Data structures

The core data structure of the clang-ast crate is Node<T>.

pub struct Node<T> {
    pub id: Id,
    pub kind: T,
    pub inner: Vec<Node<T>>,
}

The caller must provide their own kind type T, which is an enum or struct as described below. T determines exactly what information the clang-ast crate will deserialize out of the AST dump.

By convention you should name your T type Clang.


T = enum

Most often, you'll want Clang to be an enum. In this case your enum must have one variant per node kind that you care about. The name of each variant matches the "kind" entry seen in the AST.

Additionally there must be a fallback variant, which must be named either Unknown or Other, into which clang-ast will put all tree nodes not matching one of the expected kinds.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub enum Clang {
    NamespaceDecl { name: Option<String> },
    EnumDecl { name: Option<String> },
    EnumConstantDecl { name: String },
    Other,
}

fn main() {
    let json = std::fs::read_to_string("ast.json").unwrap();
    let node: Node = serde_json::from_str(&json).unwrap();

}

The above is a simple example with variants for processing "kind": "NamespaceDecl", "kind": "EnumDecl", and "kind": "EnumConstantDecl" nodes. This is sufficient to extract the set of variants of every enum in the translation unit, and the enums' namespace (possibly anonymous) and enum name (possibly anonymous).

Newtype variants are fine too, particularly if you'll be deserializing more than one field for some nodes.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub enum Clang {
    NamespaceDecl(NamespaceDecl),
    EnumDecl(EnumDecl),
    EnumConstantDecl(EnumConstantDecl),
    Other,
}

#[derive(Deserialize, Debug)]
pub struct NamespaceDecl {
    pub name: Option<String>,
}

#[derive(Deserialize, Debug)]
pub struct EnumDecl {
    pub name: Option<String>,
}

#[derive(Deserialize, Debug)]
pub struct EnumConstantDecl {
    pub name: String,
}


T = struct

Rarely, it can make sense to instantiate Node with Clang being a struct type, instead of an enum. This allows for deserializing a uniform group of data out of every node in the syntax tree.

The following example struct collects the "loc" and "range" of every node if present; these fields provide the file name / line / column position of nodes. Not every node kind contains this information, so we use Option to collect it for just the nodes that have it.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub struct Clang {
    pub kind: String,  // or clang_ast::Kind
    pub loc: Option<clang_ast::SourceLocation>,
    pub range: Option<clang_ast::SourceRange>,
}

If you really need, it's also possible to store every other piece of key/value information about every node via a weakly typed Map<String, Value> and the Serde flatten attribute.

use serde::Deserialize;
use serde_json::{Map, Value};

#[derive(Deserialize)]
pub struct Clang {
    pub kind: String,  // or clang_ast::Kind
    #[serde(flatten)]
    pub data: Map<String, Value>,
}


Hybrid approach

To deserialize kind-specific information about a fixed set of node kinds you care about, as well as some uniform information about every other kind of node, you can use a hybrid of the two approaches by giving your Other / Unknown fallback variant some fields.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub enum Clang {
    NamespaceDecl(NamespaceDecl),
    EnumDecl(EnumDecl),
    Other {
        kind: clang_ast::Kind,
    },
}


Source locations

Many node kinds expose the source location of the corresponding source code tokens, which includes:

You'll find this information in fields called "loc" and/or "range" in the JSON representation.

{
  "id": "0x1251428",
  "kind": "NamespaceDecl",
  "loc": {                           //<--
    "offset": 7004,
    "file": "/usr/include/x86_64-linux-gnu/c++/10/bits/c++config.h",
    "line": 258,
    "col": 11,
    "tokLen": 3,
    "includedFrom": {
      "file": "/usr/include/c++/10/utility"
    }
  },
  "range": {                         //<--
    "begin": {
      "offset": 6994,
      "col": 1,
      "tokLen": 9
    },
    "end": {
      "offset": 7155,
      "line": 266,
      "col": 1,
      "tokLen": 1
    }
  },
  ...
}

The naive deserialization of these structures is challenging to work with because Clang uses field omission to mean "same as previous". So if a "loc" is printed without a "file" inside, it means the loc is in the same file as the immediately previous loc in serialization order.

The clang-ast crate provides types for deserializing this source location information painlessly, producing Arc<str> as the type of filepaths which may be shared across multiple source locations.

use serde::Deserialize;

pub type Node = clang_ast::Node<Clang>;

#[derive(Deserialize)]
pub enum Clang {
    NamespaceDecl(NamespaceDecl),
    Other,
}

#[derive(Deserialize, Debug)]
pub struct NamespaceDecl {
    pub name: Option<String>,
    pub loc: clang_ast::SourceLocation,    //<--
    pub range: clang_ast::SourceRange,     //<--
}


Node identifiers

Every syntax tree node has an "id". In JSON it's the memory address of Clang's internal memory allocation for that node, serialized to a hex string.

The AST dump uses ids as backreferences in nodes of directed acyclic graph nature. For example the following MemberExpr node is part of the invocation of an operator bool conversion, and thus its syntax tree refers to the resolved operator bool conversion function declaration:

{
  "id": "0x9918b88",
  "kind": "MemberExpr",
  "valueCategory": "rvalue",
  "referencedMemberDecl": "0x12d8330",     //<--
  ...
}

The node it references, with memory address 0x12d8330, is found somewhere earlier in the syntax tree:

{
  "id": "0x12d8330",                       //<--
  "kind": "CXXConversionDecl",
  "name": "operator bool",
  "mangledName": "_ZNKSt17integral_constantIbLb1EEcvbEv",
  "type": {
    "qualType": "std::integral_constant<bool, true>::value_type () const noexcept"
  },
  "constexpr": true,
  ...
}

Due to the ubiquitous use of ids for backreferencing, it is valuable to deserialize them not as strings but as a 64-bit integer. The clang-ast crate provides an Id type for this purpose, which is cheaply copyable, hashable, and comparible more cheaply than a string. You may find yourself with lots of hashtables keyed on Id.


License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.


Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.