Fraunhofer-AISEC / cpg

A library to extract Code Property Graphs from C/C++, Java, Go, Python, Ruby and every other language through LLVM-IR.
https://fraunhofer-aisec.github.io/cpg/
Apache License 2.0
270 stars 61 forks source link

CXX preprocessor #688

Open peckto opened 2 years ago

peckto commented 2 years ago

For the tree-sitter language frontend (#604 #608) we need to take care of the preprocessor our self.

To update the location properties accordingly, we want to do the preprocessing in the kotlintree. The preprocessor operates on the tree-sitter parse tree, resolves macros and updates the location property. The parse tee will then be handed over to the cpg. In the future we might also support loading of already preprocessed code as generated with gcc -E (see #719).

The following example should outline the basic features and challenges for the preprocessor:

main.c

#include "config.h"

FUNC(void, RTE_SWC_CODE) foo() {
#if MODE == AUTO
    printf("MODE = AUTO\n");
#else
    printf("MODE = OTHER\n");
#endif
}

int main() {
#ifdef DEBUG
    printf("DEBUG\n");
#endif
    foo();

    MACRO(VERSION, "foo")

    return 0;
}

config.h

#define AUTO 1
#define OTHER 0
#define MODE OTHER
//#define MODE AUTO
#define VERSION 1

// AUTOSAR specific macro
#define FUNC(type, memclass) type

#define MACRO(num, str) {\
            printf("%d", num);\
            printf(" is");\
            printf(" %s number", str);\
            printf("\n");\
           }

Run gcc preprocessor:

$ gcc -E main.c
# 1 "main.c"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 341 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "main.c" 2
# 1 "./config.h" 1
# 2 "main.c" 2

void foo() {

 printf("MODE = OTHER\n");

}

int main() {

 foo();

    { printf("%d", 1); printf(" is"); printf(" %s number", "foo"); printf("\n"); }

 return 0;
}

Create clang AST:

$ clang -Xclang -ast-dump -fsyntax-only main.c
TranslationUnitDecl 0x557a12e37a38 <<invalid sloc>> <invalid sloc>
|-TypedefDecl 0x557a12e382f0 <<invalid sloc>> <invalid sloc> implicit __int128_t '__int128'
| `-BuiltinType 0x557a12e37fd0 '__int128'
|-TypedefDecl 0x557a12e38360 <<invalid sloc>> <invalid sloc> implicit __uint128_t 'unsigned __int128'
| `-BuiltinType 0x557a12e37ff0 'unsigned __int128'
|-TypedefDecl 0x557a12e38668 <<invalid sloc>> <invalid sloc> implicit __NSConstantString 'struct __NSConstantString_tag'
| `-RecordType 0x557a12e38440 'struct __NSConstantString_tag'
|   `-Record 0x557a12e383b8 '__NSConstantString_tag'
|-TypedefDecl 0x557a12e38700 <<invalid sloc>> <invalid sloc> implicit __builtin_ms_va_list 'char *'
| `-PointerType 0x557a12e386c0 'char *'
|   `-BuiltinType 0x557a12e37ad0 'char'
|-TypedefDecl 0x557a12e78640 <<invalid sloc>> <invalid sloc> implicit __builtin_va_list 'struct __va_list_tag [1]'
| `-ConstantArrayType 0x557a12e389a0 'struct __va_list_tag [1]' 1
|   `-RecordType 0x557a12e387e0 'struct __va_list_tag'
|     `-Record 0x557a12e38758 '__va_list_tag'
|-FunctionDecl 0x557a12e786e8 <main.c:3:6, line:9:1> line:3:26 used foo 'void ()'
| `-CompoundStmt 0x557a12e78e98 <col:32, line:9:1>
|   `-CallExpr 0x557a12e78e40 <line:7:2, col:25> 'int'
|     |-ImplicitCastExpr 0x557a12e78e28 <col:2> 'int (*)(const char *, ...)' <FunctionToPointerDecay>
|     | `-DeclRefExpr 0x557a12e78d60 <col:2> 'int (const char *, ...)' Function 0x557a12e78bc0 'printf' 'int (const char *, ...)'
|     `-ImplicitCastExpr 0x557a12e78e80 <col:9> 'const char *' <NoOp>
|       `-ImplicitCastExpr 0x557a12e78e68 <col:9> 'char *' <ArrayToPointerDecay>
|         `-StringLiteral 0x557a12e78db8 <col:9> 'char [14]' lvalue "MODE = OTHER\n"
|-FunctionDecl 0x557a12e78bc0 <col:2> col:2 implicit used printf 'int (const char *, ...)' extern
| |-ParmVarDecl 0x557a12e78cb8 <<invalid sloc>> <invalid sloc> 'const char *'
| |-BuiltinAttr 0x557a12e78c60 <<invalid sloc>> Implicit 794
| `-FormatAttr 0x557a12e78d28 <col:2> Implicit printf 1 2
`-FunctionDecl 0x557a12e78f00 <line:11:1, line:21:1> line:11:5 main 'int ()'
  `-CompoundStmt 0x557a12e79500 <col:12, line:21:1>
    |-CallExpr 0x557a12e79000 <line:15:2, col:6> 'void'
    | `-ImplicitCastExpr 0x557a12e78fe8 <col:2> 'void (*)()' <FunctionToPointerDecay>
    |   `-DeclRefExpr 0x557a12e78fa0 <col:2> 'void ()' Function 0x557a12e786e8 'foo' 'void ()'
    |-CompoundStmt 0x557a12e794a0 <./config.h:10:25, line:15:12>
    | |-CallExpr 0x557a12e790e8 <line:11:13, col:29> 'int'
    | | |-ImplicitCastExpr 0x557a12e790d0 <col:13> 'int (*)(const char *, ...)' <FunctionToPointerDecay>
    | | | `-DeclRefExpr 0x557a12e79020 <col:13> 'int (const char *, ...)' Function 0x557a12e78bc0 'printf' 'int (const char *, ...)'
    | | |-ImplicitCastExpr 0x557a12e79130 <col:20> 'const char *' <NoOp>
    | | | `-ImplicitCastExpr 0x557a12e79118 <col:20> 'char *' <ArrayToPointerDecay>
    | | |   `-StringLiteral 0x557a12e79078 <col:20> 'char [3]' lvalue "%d"
    | | `-IntegerLiteral 0x557a12e79098 <line:5:17> 'int' 1
    | |-CallExpr 0x557a12e791f8 <line:12:13, col:25> 'int'
    | | |-ImplicitCastExpr 0x557a12e791e0 <col:13> 'int (*)(const char *, ...)' <FunctionToPointerDecay>
    | | | `-DeclRefExpr 0x557a12e79148 <col:13> 'int (const char *, ...)' Function 0x557a12e78bc0 'printf' 'int (const char *, ...)'
    | | `-ImplicitCastExpr 0x557a12e79238 <col:20> 'const char *' <NoOp>
    | |   `-ImplicitCastExpr 0x557a12e79220 <col:20> 'char *' <ArrayToPointerDecay>
    | |     `-StringLiteral 0x557a12e791a8 <col:20> 'char [4]' lvalue " is"
    | |-CallExpr 0x557a12e79320 <line:13:13, col:37> 'int'
    | | |-ImplicitCastExpr 0x557a12e79308 <col:13> 'int (*)(const char *, ...)' <FunctionToPointerDecay>
    | | | `-DeclRefExpr 0x557a12e79250 <col:13> 'int (const char *, ...)' Function 0x557a12e78bc0 'printf' 'int (const char *, ...)'
    | | |-ImplicitCastExpr 0x557a12e79368 <col:20> 'const char *' <NoOp>
    | | | `-ImplicitCastExpr 0x557a12e79350 <col:20> 'char *' <ArrayToPointerDecay>
    | | |   `-StringLiteral 0x557a12e792a8 <col:20> 'char [11]' lvalue " %s number"
    | | `-ImplicitCastExpr 0x557a12e79380 <main.c:17:20> 'char *' <ArrayToPointerDecay>
    | |   `-StringLiteral 0x557a12e792d0 <col:20> 'char [4]' lvalue "foo"
    | `-CallExpr 0x557a12e79448 <./config.h:14:13, col:24> 'int'
    |   |-ImplicitCastExpr 0x557a12e79430 <col:13> 'int (*)(const char *, ...)' <FunctionToPointerDecay>
    |   | `-DeclRefExpr 0x557a12e79398 <col:13> 'int (const char *, ...)' Function 0x557a12e78bc0 'printf' 'int (const char *, ...)'
    |   `-ImplicitCastExpr 0x557a12e79488 <col:20> 'const char *' <NoOp>
    |     `-ImplicitCastExpr 0x557a12e79470 <col:20> 'char *' <ArrayToPointerDecay>
    |       `-StringLiteral 0x557a12e793f8 <col:20> 'char [2]' lvalue "\n"
    `-ReturnStmt 0x557a12e794f0 <main.c:20:2, col:9>
      `-IntegerLiteral 0x557a12e794d0 <col:9> 'int' 0
oxisto commented 2 years ago

Thanks for the write up. I would avoid the term annotation. Basically it seems that # 4 "main.c" is a shorthand version of the #line preprocessor declaration https://docs.microsoft.com/en-us/cpp/preprocessor/hash-line-directive-c-cpp?view=msvc-170

peckto commented 2 years ago

Thanks for the write up. I would avoid the term annotation. Basically it seems that # 4 "main.c" is a shorthand version of the #line preprocessor declaration https://docs.microsoft.com/en-us/cpp/preprocessor/hash-line-directive-c-cpp?view=msvc-170

Thanks for the hint! That was not clear to me. If its an official preprocessor declaration, its even better. I updated the issue description accordingly.

peckto commented 2 years ago

I had some thoughts on tree-sitter and I think, I see more clear now. If I see correctly, the tree-sitter parse-tree can only be updated by reloading the source file. My misconception was, that I thought, we could update the parse-tree directly, which would allow us, to update the location, when resolving macros, the way we want. But if we need to update the source file anyhow, where is then the difference in using an external preprocessor like gcc -E? If we need the step with the source file, I don't see, how our own preprocessor implementation could do any better than gcc -E. Or do I miss something?

peckto commented 2 years ago

Currently, the cpg handles macro resolution in the following way: image

Code and file location are pointing to the un-resolved macro, but the node type and its specific properties (eg. name) are pointing to the resolved macro. Do we want to stay with this notion? Can we do it better? If I remember correctly, there is the special situation, where macros are used to expose the public API of a software (like with openssl). In this situation the user might expect to see the macro, instead of the resolved, internal function name. Is there a strong requirement in the direction of codyze?

oxisto commented 2 years ago

Currently, the cpg handles macro resolution in the following way: image

Code and file location are pointing to the un-resolved macro, but the node type and its specific properties (eg. name) are pointing to the resolved macro. Do we want to stay with this notion? Can we do it better? If I remember correctly, there is the special situation, where macros are used to expose the public API of a software (like with openssl). In this situation the user might expect to see the macro, instead of the resolved, internal function name. Is there a strong requirement in the direction of codyze?

cc @fwendland for Codyze