clice-project / clice

MIT License
24 stars 1 forks source link

Support header context #4

Open 16bit-ykiko opened 1 week ago

16bit-ykiko commented 1 week ago

Issues in clangd:

What #include in C/C++ does is just simplify copying the included file contexts to its location. Only .c/.cpp(Translation Unit, i.e, TU) files will participate in the final compilation process and occur in compile_commands.json with corresponding command.

As we all know, clangd is clang based, we need to run clang frontend for given source file to get AST or code completion. Then, we could response LSP requests. For cpp files, it's trivial. We just need to complie it as normal in clang driver. The only difference is we only generate AST no further step to generate LLVM IR.

But what about header files? How clangd deal with header files? clangd just regards a header file as a translation unit, and generate AST for it. Compilation commands are guessed from source file, e.g. based on file name match. The simple way works, but is totally incomplete!

Since a header file is only part of the source file, its AST is likely dependent on the preceding text and may have different ASTs in different translation units. For example:

// a.h
#ifdef TEST
struct X { ... };
#else
struct Y { ... };
#endif

// b.cpp
#define TEST
#include "a.h"

// c.cpp
#include "a.h"

It's obvious that a.h has different AST in b.cpp and c.cpp Currently, clangd can only get the AST in c.cpp, i.e, treats it as a single TU. Another more extreme example is non self-contained header file(file cannot be complied individually). In above the file, though only one AST will be used, at least it can work. Consider following example:

// a.h
struct Y { 
    X x;
};

// b.cpp
struct X {};
#include "a.h"

clangd will emit compilation error for a.h, because it cannot find the definition of X, which is defined in its header context--b.cpp.

This could be really frustrating. We should support check, lookup and switch context of header file!

16bit-ykiko commented 1 week ago

Overall, we want to achieve the following effect: assuming both b and c include a and generate different ASTs for it, when jumping from file b to file a, b is used as the context; when jumping from file c to file a, c is used as the context.

Possible challenges:

Could we still build preamble for header with given context?

The answer is yes, the only thing we need to do is computing preamble bounds ourselves. For example, assume we have following file:

#include <string>
#include <vector>

// ... a lot of code

#include <user.h>  <=

// ... a lot of code

The target header file is inside user.h. So we can build all the code before user.h as the preamble and cut off all code after it to improve performance. In the end, we can use the AST to render LSP response for the header file. For code completion, things are similar. Build the same preamble, and configure at FrontendOpts::CodeCompletionAt like

auto& completion = instance->getFrontendOpts().CodeCompletionAt;
completion.FileName = filepath;
completion.Line = line;
completion.Column = column;

The filepath is the header file path(for clangd, it is main file). Then it can work perfectly. And code completion will cut off the source file automatically, so we don't need to any other thing.

How server maintain the state of multi contexts?

For every header file, unlike TU, server will use a StringMap to track all its context, i.e, every context has an AST and record the currently active context. If server request some go-to requests like textDocument.declaration and the destination location is in header file, it will try to switch current context of header file to the file where request is emited. In this way, we can achieve our goal.

Besides, providing extension requests: headerContext/current, headerContext/all and headerContext/switch to allow users to proactively query and switch header contexts.

How can we find all possible header contexts for one file?

  1. If you preindex the project, server can get the dependency graph of the project. Then this could be done. Otherwise, it will just try to search in active TUs.
  2. User can config extra header contexts for specific file.

    Further improvement

    • There is a technology called chained precompiled headers, which means we can use one PCH to build another to reduce incremental compilation granularity. I think this could be used in the preamble building of header with context.
    • It's common that one header is included by a lot of source file. For example, llvm::ADT/SmallVector is included by nearly every source file in llvm. And it is self contained file, so we would better to distinguish these files with others to reduce memory usage.
16bit-ykiko commented 1 day ago

And it's possible that a header file is included multiple times in one source file. For a header file with guard macro or #pragma once, the second including will generate nothing. For others, e.g. TokenKinds.def in llvm, are designed to be included multiple times. So it's necessary to support switching header context in same file. Luckily, clang will track the include location of token, which means that we can distinguish them easily.

For example:

// test.h
struct X;

// test2.h
#include "test.h"

// test.cpp
#include "test.h"
#include "test2.h"

We will get two CXXRecordDecl in final AST and they both dump as test.h:1:1. But we can use SourceManager::getIncludeLoc to to track include stack. For first decl, the result is test.cpp:1:10. For second, the result is test2.h:1:10, call again, result is test.cpp:2:10.

By the way, there is also a function called HeaderSearch::isFileMultipleIncludeGuarded, which could be used to determine whether a header file has an guard macro or #pragma once.