Fraunhofer-AISEC / cpg

A library to extract Code Property Graphs from C/C++, Java, Go, Python, Ruby and every other language through LLVM-IR.
https://fraunhofer-aisec.github.io/cpg/
Apache License 2.0
246 stars 59 forks source link

Discussion on types and type name normalisation #1533

Open oxisto opened 1 month ago

oxisto commented 1 month ago

Implementation Status

### Tasks
- [ ] https://github.com/Fraunhofer-AISEC/cpg/issues/1539
- [ ] https://github.com/Fraunhofer-AISEC/cpg/issues/1540
- [ ] Rework type resolution
- [ ] https://github.com/Fraunhofer-AISEC/cpg/issues/1535

Motivation

Note: In the following I will mainly use C++ code as a motivating factor, but this also affects other languages as well. However, C++ is the one language that has most of the fqn/namespace/scoping feature set of all the languages we support.

The current way we handle types has a major drawback: In order to save on the creation of Type nodes, we have a rather intricate system in the objectType function that checks whether a type already exists with the given name. So if you consider the following small C++ code:

class MyClass {
  MyClass(MyClass* other) {
    this->field = other->field;
  }
  int field;
};

Instead of having 2 different MyClass type nodes, we only end up with one; which is pretty good to save space. This also (sort of) works if we are in a simple namespace:

namespace awesome {
class MyClass {
  MyClass(MyClass* other) {
    this->field = other->field;
  }
  int field;
};
}

In this case, we already run into problems to differentiate the types and we need to use the scope manager information during the frontend translation (which we want to avoid and is only limited in scope) to retrieve the current namespace and then append it to the type. In this case we end up with 2 awesome::MyType nodes.

The problems continue for example if we have a non-qualified static call inside the namespace.

class OutsideClass {
  static doStatic() {}
};
namespace awesome {
class OtherClass {
  static doStatic() {}
};
class MyClass {
  MyClass() {}
  MyClass(MyClass* other) {
    this->field = other->field;

    // this works
    OutsideClass::doStatic();
    // this won't work
    OtherClass::doStatic();
  }
  int field;
};
}

Currently, in this case, type resolution for OtherClass fails because we did not prefix the static call with the current namespace.

Further problems arise if we want to properly support things like using in C++, which "imports" either a namespace of a symbol and we potentially need to look for several namespaces in order to find a match:

using namespace std;
int main() {
   // here, we only know that "string" is part of the `std` namespace, once we know all the types
   string s;
}

This is also comparable to python, where we can import everything from a package into the scope

from sys import *
print(argv)

Problems to tackle

Problem 1: Type FQN

A possible solution would be to only do a limited type parsing during frontend translation and perform a "type resolution" in a later pass (e.g. the TypeResolver or a separate one). In the examples above we would only parse MyClass as the type during the frontend and then later resolve its name to awesome::MyClass. This would also help with partial qualified name (as possible in C++, see https://github.com/Fraunhofer-AISEC/cpg/issues/1126 for details).

In order to do so, I propose a new class TypeReference. This class is a sort of hybrid between a Reference and a Type. It holds a refersTo to a declaration; or rather a specific subset of declarations that can declare types, but it derives from Type in order to support comparison to other types. We could use the same logic in resolving those types as the regular symbol resolver, because in the end even references to types are just symbols. This could make it necessary to make the type property an AST edge.

For us to differentiate which declarations can declare types a DeclaresType interface might be a good idea and refersTo could be limited to that interface.

The TypeReference would probably have two states:

Problem 2: Symbol Importing

Currently, a Scope has a list of valueDeclarations and structureDeclarations. I propose instead to have a map of symbols, which hold a (local) name as key and contain a list of declarations that are available in this scope.

/**
 * A symbol is a simple, local name. It is valid within the scope that declares it and all of its
 * child scopes. However, a child scope can usually "shadow" a symbol of a higher scope.
 */
typealias Symbol = String

var symbols = mutableMapOf<Symbol, List<Declaration>>()

This would make it easier to lookup the declarations for a specific name within a scope. A simple lookup algorithm for name could look this:

We can have a new node type called ImportDeclaration that defines the import during frontend translation and then resolves the imports in a pass. We also need to resolve the imports of a symbol, that contains a ImportDeclaration. In order to do that we have a dedicated ImportResolver task that resolves the imported symbols of all ImportDeclartion nodes. This pass needs to be executed very early.

Problem 3: Aliases

Not sure yet. If we have the symbols from the previous approach, we could maybe have an AliasDeclaration node, that lives in the symbol map but points to the original declaration.