Support non-self-contained files

kadircet commented 5 years ago

Currently clangd doesn't work with non-self-contained files, even if people have a way to build their projects with those files, for example by merging some/all of non-self-contained files into a single TU and building that file.

bstaletic commented 5 years ago

For posterity, ycmd/libclang allows this by allowing users to return something like { 'flags': list_of_flags, 'override_filename': some_file_name } from the extra confs.

That makes the TU main file be the value of override_filename, but completion is still requested for the original filename. As for the libclang API, this means:

clang_parseTranslationUnit2FullArgv gets passed the value of override_filename
clang_codeCompleteAt gets passed the original filename.

Barricade commented 4 years ago

Really waiting for this feature to release!

bstaletic commented 4 years ago

For the record, YCM has just heard from another users who needed override_filename. This feature can also be supported for users who use compile_commands.json, if the comp db specification can be extended.

[
  {
    "directory": "/home/bstaletic/test/",
    "command": [ "/usr/bin/c++", "-c", "bar.cpp" ],
    "file": "/home/bstaletic/test/bar.cpp",
    "override_filename": "/home/bstaletic/test/foo.cpp"
  }
]

In the above example:

directory is the same as it always was.
command is the same - contains the actual command line.
file would contain the file containing the cursor.
override_filename would contain the TU to be parsed

This is just an idea that, frankly, hasn't been given much thought and it doesn't address the problem of wiring this throughout the clangd code.

Considering that the latest cmake has implemented support for unity builds and that the comp db generation is broken when used with unity builds, override_filename should probably be considered again.

n00bmind commented 4 years ago

+1000 to this. I think unity builds are gaining more and more traction, and this would massively help a lot of people.

intractabilis commented 4 years ago

Documentation to llvm 8 said that to achieve what people ask in this bug, one needs to use clangd-indexer. These lines disappeared from llvm 10 documentation, it just says that clangd-indexer is for developers.

Why this feature is so complicated?

HighCommander4 commented 4 years ago

Documentation to llvm 8 said that to achieve what people ask in this bug, one needs to use clangd-indexer. These lines disappeared from llvm 10 documentation, it just says that clangd-indexer is for developers.

AFAIK, clangd-indexer has never supported the use cases described in this bug. The "project-wide view" mentioned in the linked doc just refers to e.g. being able to perform Find References and get results in other translation units.

Mention of clangd-indexer has been removed because that functionality is now built into clangd via the background indexing feature.

sam-mccall commented 4 years ago

Yes, what Nathan said.

Why this feature is so complicated?

A few main reasons:

clangd relies heavily on the preamble optimization: splitting the file in two parts which can be built separately. This is essential for performance (10-1000x speedup on common operations depending on the codebase) and greatly complicates the compilation model. Cross-file preambles will complicate it further.
preamble building is a part of clang, it's complicated and entangled with other features, none of us understand it well enough that making changes is straightforward.
clangd relies in many places on clang's notion of the main file, untangling the concepts of "interesting file" and "entrypoint" is a lot of work, and we don't have great ideas for isolating the complexity
(dons flame-retardant suit) Self-contained files are good software engineering practice (and don't prevent unity builds AFAIK). Most clangd developers are paid by employers who enforce this practice - in part to make tools easier to write. We do spend lots of time on things to make clangd more well-rounded in general even outside what $employer cares about directly, but it affects prioritization for sure, especially for features like this that are probably many months of work.

intractabilis commented 4 years ago

Self-contained files are good software engineering practice

Maybe, but I can't rewrite all C++ libraries. Jumping to definitions is most useful when exploring unfamiliar external code. I have no influence to force all the authors to implement good practices.

intractabilis commented 4 years ago

Wait, Nathan? Eclipse CDT supporter, who suggested to use clangd instead of CDT's cpp-indexer? You actually work on clangd? :)

bstaletic commented 4 years ago

(and don't prevent unity builds AFAIK)

They kinda do. If your entire codebase is "shoved" into a single TU, your compile_commands.json contains only that single entry which may or may not work for other directories in that codebase.

I've heard that people work around this problem by running cmake twice (now that cmake actually has built-in support for unity builds). Once with unity off, to get a working compile_commands.json and once with unity on, to compile the project. This is of no use to those who don't use cmake.

HighCommander4 commented 4 years ago

Wait, Nathan? Eclipse CDT supporter, who suggested to use clangd instead of CDT's cpp-indexer? You actually work on clangd? :)

I contribute to it, yeah :) I switched from Eclipse CDT about a year ago.

HighCommander4 commented 4 years ago

Self-contained files are good software engineering practice (and don't prevent unity builds AFAIK).

Clangd does not currently work for unity builds (see e.g. https://github.com/clangd/clangd/issues/147#issuecomment-529223278). Perhaps you mean support for them could be added more easily than by solving the challenging issues you described?

Most clangd developers are paid by employers who enforce this practice

To be fair, LLVM itself does use non-self contained files at times. I'm thinking of things like #include "clang/Basic/TokenKinds.def".

sam-mccall commented 4 years ago

Clangd does not currently work for unity builds (see e.g. #147 (comment)).

Right, but if you have self-contained files, you don't have to always do a unity build with them. Wouldn't you only want to do the unity build as a final heavily-optimized, slow-building, non-incremental release build, but use a regular build for development?

To be fair, LLVM itself does use non-self contained files at times

LLVM does lots of questionable things :-) Tablegen .def format falls more into the "stupid pp tricks" bucket than sensible C++ code we're just having trouble parsing, though.

puremourning commented 4 years ago

but use a regular build for development?

one of the main purposes of jumbo TU is to reduce compile time where it's dominated by IO (particularly includes, etc.). At work i benchmarked speedup comparable to the preamble trick on compiling a libarary as a single jumbo vs -j on a 40 core system. in this case, dominating factor was the SAN

HighCommander4 commented 4 years ago

Wouldn't you only want to do the unity build as a final heavily-optimized, slow-building, non-incremental release build, but use a regular build for development?

No, some projects use unity builds for build performance (since it means common includes only have to be parsed once per unified-TU rather than once per individual-TU).

Some projects take this a step further and only support unity builds, since non-unity builds can easily break due to include-what-you-use violations if devs are doing unity builds locally, and the project may not have the CI resources to test both configurations. Such projects don't even have the option of doing a non-unity build specifically for clangd's consumption.

intractabilis commented 4 years ago

What is unity build?

intractabilis commented 4 years ago

I switched from Eclipse CDT about a year ago

I am switching now. Especially after C++20 CDT became unbearable.

HighCommander4 commented 4 years ago

What is unity build?

It's a technique where groups of source files are combined (by e.g. #include-ing several .cpp files into one file), and the resulting files are compiled, rather than the individual source files. It has several advantages, such as common includes only having to be parsed once per combined source file, more inlining / inter-procedural optimization opportunities, etc. (And also downsides, such as a greater potential to introduce undetected include-what-you-use violations, and incremental builds potentially being slower.)

intractabilis commented 4 years ago

It's a technique where groups of source files are combined (by e.g. #include-ing several .cpp files into one file), and the resulting files are compiled, rather than the individual source files.

Wow. Never've heard of this technique.

abpostelnicu commented 4 years ago

@sam-mccall so just to understand this correctly, for the moment this is very hard to implement because of https://github.com/clangd/clangd/issues/45#issuecomment-640496002. Does what you said earlier also apply in an unified-build environment?

Trass3r commented 3 years ago

(and don't prevent unity builds AFAIK)

They kinda do. If your entire codebase is "shoved" into a single TU, your compile_commands.json contains only that single entry which may or may not work for other directories in that codebase.

I've heard that people work around this problem by running cmake twice (now that cmake actually has built-in support for unity builds). Once with unity off, to get a working compile_commands.json and once with unity on, to compile the project. This is of no use to those who don't use cmake.

Vote for https://gitlab.kitware.com/cmake/cmake/-/issues/19826

Volker-Weissmann commented 3 years ago

My two cents about this:

You should definitely support non-self contained files. The C++ standard allows it and many big projects (e.g OpenFOAM a.k.a. my usecase) do it. And e.g. the rust language server support non-self contained files in rust.
As long as you don't have support for non-self contained files, you need to document it. Claiming that clangd can parse C++ is false advertisement, because non-self contained files are part of the C++ standard. You should replace "clangd understands your C++ code" on your website with "clangd understands your C++ code, if you follow these(link here) coding guidlines." Nobody, except for a very small minority, knows this limitation of clangd, but it is important to consider this limitation when choosing your tool. (For example the guys in r/cpp don't know this: https://www.reddit.com/r/cpp/comments/j57se3/tool_for_go_to_definition/ .)
As long as you don't have support for non-self contained files, you need to faIl LOUDLY. Currently you fail super-silently. Clangd logs ... Updating file .../lib.cpp with command inferred from .../main.cpp which means that clangd nows that lib.cpp is not self-contained. If you notice that a file is not self contained, you need to write an error message in the logfile with a link to this issue. It would be even better if trying to "Go to Definition" in a non-self contained file would show "This is not a self contained file, see https://github.com/clangd/clangd/issues/45" instead of ""No definition found for 'func". Otherwise it will waste a lot of time for everyone trying to figure out why it is not working.

Sincerely, someone who is pissed off, because he just spent 3 hours creating a minimal example of this bug after noticing that "Go to Definition" does not work for one specific function in a big program.

Slightly OT: Do you know a tool, similar to clangd that even works for non-self-contained files?

bstaletic commented 3 years ago

which means that clangd nows that lib.cpp is not self-contained.

Ranting aside, no. That only means that lib.cpp wasn't mentioned in your compile_commands.json. I don't think there is a way to detect that a file is not sef-contained, other than being told by the user.

Slightly OT: Do you know a tool, similar to clangd that even works for non-self-contained files?

The only thing that I know of that works with non-self-contained files in this regard is ycmd, with the legacy libclang based completion engine (i.e. not ycmd + clangd). (Yes, "shameless plug" and all that.) However, I did call it "legacy" for a reason - there are things that clangd suppports, but ycmd+libclang don't.

n00bmind commented 3 years ago

Funny how a request to elevate quality standards, complete with some actionable points (even if not completely on target) is quickly dismissed as "ranting". Says quite a bit about the state of software these days, imo.

Aaanyway, I briefly tried ccls in the past (https://github.com/MaskRay/ccls), and it seemed to at least try in regards to unity builds. The project I did test it on is pretty simple so I cannot really say how thorough that support is though. I had some performance related troubles using it from Vim and so had to look for something else.. might be worth a try in any case.

intractabilis commented 3 years ago

Slightly OT: Do you know a tool, similar to clangd that even works for non-self-contained files?

Ironically Microsoft's language server (https://github.com/Microsoft/vscode-cpptools) works flawlessly even for clang and GCC in Linux.

Volker-Weissmann commented 3 years ago

Slightly OT: Do you know a tool, similar to clangd that even works for non-self-contained files?

Ironically Microsoft's language server (https://github.com/Microsoft/vscode-cpptools) works flawlessly even for clang and GCC in Linux.

It also does not support including *.cpp files: https://github.com/microsoft/vscode-cpptools/issues/6250

Volker-Weissmann commented 3 years ago

Aaanyway, I briefly tried ccls in the past (https://github.com/MaskRay/ccls), and it seemed to at least try in regards to unity builds. The project I did test it on is pretty simple so I cannot really say how thorough that support is though. I had some performance related troubles using it from Vim and so had to look for something else.. might be worth a try in any case.

I'm currently using it. It supports non-self-contained files, but fails in some other cases.

playgithub commented 3 years ago

clangd is better to depend on intermediate result when compiling, I think it should also be tolerant enough for errors, for it is to assist developers, not for building. I don't think it's very hard for compiler experts.

Volker-Weissmann commented 2 years ago

clangd is better to depend on intermediate result when compiling,

That would mean that you have to build it (slow) before you can use clangd. That being said, that wouldn't be that big of a problem.

playgithub commented 2 years ago

clangd is better to depend on intermediate result when compiling,

That would mean that you have to build it (slow) before you can use clangd. That being said, that wouldn't be that big of a problem.

Optimizations can be done, e.g.

the compiler works in background threads
partial data produced by the compiler should be useful to assist developers, and it should support incrementality
higher priority for recent used files
cache in memory
no compiling optimization needed
...

It's just possible, and not hard to achieve the results like Visual Stduio.

playgithub commented 2 years ago

Without using vscode-cpptools, is there a solution for the problem on Linux now?

bstaletic commented 2 years ago

As I've written before, ycmd/libclang supports these use cases, but with libclang we can't do indexing, so no rename, references... ycmd does have a vscode client, but I have no idea how or even if it is maintained. Frankly we still support libclang only because of this missing feature.

sam-mccall commented 2 years ago

A non-update here: this is a hard problem, and we're open to solutions but (AFAIK) none of the active contributors are committed to working on this.

If someone wants to attempt this problem, there's more to do than implementation and it'd be worth discussing design here first.

The design needs to solve some hard constraints:

There needs to be a clear scope and idea of how it'll solve the problems. For instance:
- does it support files with parse context (e.g. snippets of class bodies like ConfigFragment.inc), or just PP context? If parse context, how is this persisted or recreated? (PCH can't do this today). How does this interact with preambles?
- how is the file-that-provides-context identified? Is the existing index actually good/available enough for this?
it needs to not add too much complexity, or at least isolate the complexity to a few places. "Priorities and effort" also means this can't become a maintenance timesink.
it needs to be useful/powerful enough to justify the complexity that adds. (This ~certainly means it needs to be safe to turn on by default eventually)

I'm not certain it's possible to satisfy all these, but it seems like it might be possible.

sam-mccall commented 2 years ago

FWIW my best guess is:

we'd want to build a preamble for the entrypoint file and then parse everything from there. This involves eliminating from many places the assumptions that a) the preamble is part of the file we care about, b) the file we care about is the "main file" as far as SourceManager is concerned.
a significant challenge with this solution is the performance where the fragment file has #includes which are not part of the entrypoint file's preamble
the include graph in the background index or the includer cache in TUScheduler are a reasonable initial step for identifying entrypoints but we'll eventually want something better built by fast include-scanning.

bstaletic commented 2 years ago

it needs to not add too much complexity, or at least isolate the complexity to a few places. "Priorities and effort" also means this can't become a maintenance timesink.

Ycmd's implementation was not a timesink at all. In fact, no maintenance was needed after the initial PR. Our implementation does mean that needed a path to the main file now needs two paths - file path to be parsed and a file path of the cursor location.

puremourning commented 2 years ago

it needs to not add too much complexity, or at least isolate the complexity to a few places. "Priorities and effort" also means this can't become a maintenance timesink.

Ycmd's implementation was not a timesink at all. In fact, no maintenance was needed after the initial PR. Our implementation does mean that needed a path to the main file now needs two paths - file path to be parsed and a file path of the cursor location.

To confirm: the ycmd implementation to support this took me less than an hour. (apropos nothing; ycmd libclang engine is an invisible fraction of complexity compared with clangd)

johnhubbard commented 2 years ago

I just ran into this problem while trying to look at things like wake_up_var() in today's Linux kernel. This symbol is in a file (wait_bit.c) that is included from within another file (build_utility.c), and so VS Code + clangd cannot browse any of the symbols within wait_bit.c.

This is a serious limitation. If you look at the top of build_utility.c (and also look through "git blame build_utility.c"), you'll notice that this is an intentional, recent effort to reduce build times. And that means that my chances of changing this are basically zero--people want it this way.

So now I'm reduced to maintaining local hacks in order to browse a big chunk of the sched unit in the Linux kernel, with a fair chance that the damage will continue to spread, because this is apparently a popular way to speed up builds.

This is an extremely bad situation that will drive all of us back to horrible older techniques, unless clangd loses the "this is unholy and impure and bad software engineering, so we are not interested" attitude. Please reconsider that, because Real Projects are messy and impure!

HighCommander4 commented 2 years ago

unless clangd loses the "this is unholy and impure and bad software engineering, so we are not interested" attitude. Please reconsider that, because Real Projects are messy and impure!

I believe it's been clarified in later comments (particularly this one) that the project is open to contributions to improve support for non-self-contained files.

It's just that there are non-trivial technical challenges to overcome (as detailed in this comment and this one), and someone invested in improving support for these scenarios will need to put in the legwork to solve them.

johnhubbard commented 2 years ago

Very good. In that case, I'll tell my compiler engineering colleagues about this, and see if it sparks any interest.

xulongzhe commented 1 year ago

nearly 4 years past, is there any body work on it? we have lots of legacy code write as non-self-contained style, looking forward this feature, so we can migrate to vscode from VS or CLion as our main IDE. Vscode + Clangd is so good for daily deploping, is three any plan?

mlabbe commented 1 year ago

Since January I have successfully worked around this with a method I have not seen proposed anywhere else, as outlined in this blog post which I wrote at the time. https://www.frogtoss.com/labs/clangd-with-unity-builds.html

simon8233 commented 1 year ago

another workaround is -include="xxx.h" in compile_commands or clangd file

detly commented 9 months ago

Just to add a data point to this that is not about unity builds: I have seen and used this style for DSP and embedded projects, where

there is a C module exposing a small number of API functions in module_a/include/module_a.h
module_a/src/module_a.c contains those function with external linkage, but also extra helper functions with internal linkage (ie. static functions)
we want to unit test the helper functions, because they're sufficiently complex to warrant it

I mention DSP and embedded as a context because it relates to point #2. It seems like one solution to this would be to have the helper functions have external linkage, and have separate private header files for them (or just put them in a separate module). Unit tests could then use that private module/header's API.

But this means that the functions have to appear in the symbol table, and can't be inlined, or can be inlined but need to be duplicated as non-inlined versions, even though they're not meant to be public. It's bloat and pessimisation we want to avoid for DSP/embedded applications.

Another solution that avoids this is to consider that the unit tests for that module form a kind of single-TU project (or subproject), where only the main function (or top-level test runner function) is exposed, and it has the same set of static "helper" functions as the real project. And so your unit tests include the implementation file (module_a/src/module_a.c):

// module_a/test/test_module_a.c
#include "unit_test_harness_whatever.h"
#include "test_module_a.h"
// Note include of implementation file!
#include "module_a/src/module_a.c"

static void test_module_a_helper_x_but_not_y(void) { /* ... */ }

static void test_module_a_helper_annoying_corner_case(void) { /* ... */ }

void test_module_a_all(void) {
    RUN_TEST(test_module_a_helper_x_but_not_y);
    RUN_TEST(test_module_a_helper_annoying_corner_case);
}

Yes it's niche, yes it looks jarring and unidiomatic at first, but it's valid, it makes sense and works in these contexts, and I've seen it in real, professional, high-quality C code bases. (It also tests that the implementation file is appropriately self-contained!)

There are other solutions to this as well, eg. #define your linkage keyword to be static for library builds but empty for unit test builds. But that has drawbacks, and the benefits of support from a tool that hadn't been invented yet couldn't really be factored in to the decision (and probably aren't motivating enough to do retroactively).

puremourning commented 9 months ago

Well yes. This fundamentally applies to any TU which is composed of multiple files (including headers, c, cpp, inc, etc). Clangd models TU=file and as the others have said it's a big refactor to fix it.

detly commented 9 months ago

Well yes. This fundamentally applies to any TU which is composed of multiple files

You're not wrong, I just wanted to provide another data point to help avoid some kind of unity-build-specific solution (I have no idea how that would arise, but you never know).

sliedes commented 4 months ago

clangd relies heavily on the preamble optimization: splitting the file in two parts which can be built separately. This is essential for performance (10-1000x speedup on common operations depending on the codebase) and greatly complicates the compilation model. Cross-file preambles will complicate it further.

Ok, I'm trying to understand the problem a bit more deeply. Is this a fair restatement of what you said above?:

Let's say we have CPython, which is a bit of a mess of non-self-contained headers, but basically the external interface is that anyone using it is supposed to #include <Python.h>. Probably that header is approximately the only header in the project that can be assumed to be self-contained.

While it would be possible to devise a mechanism to allow the user to say "do not even try to parse this header as a TU, parse Python.h instead"; but this would necessarily result in abysmal performance at least when someone edits one of those headers, as most of the entire Python.h needs to be reparsed. (For users, of course, I'd assume that they don't edit the headers often. For CPython developers, they might.)

Or would it even affect performance for people just using <Python.h> without going deeper into the private headers?

preamble building is a part of clang, it's complicated and entangled with other features, none of us understand it well enough that making changes is straightforward.

clangd relies in many places on clang's notion of the main file, untangling the concepts of "interesting file" and "entrypoint" is a lot of work, and we don't have great ideas for isolating the complexity

Is this more of a problem of "where do we get that information"? I noticed that there's support for IWYU pragmas for specifying things like "this header is private; if you want its symbols, always include header X instead". That seems very much the same kind of configuration data that would be needed by this, but I think it's currently only used for flagging non-direct includes and automatically adding includes.

Or is it something more nuanced?

(dons flame-retardant suit) Self-contained files are good software engineering practice (and don't prevent unity builds AFAIK). Most clangd developers are paid by employers who enforce this practice - in part to make tools easier to write. We do spend lots of time on things to make clangd more well-rounded in general even outside what $employer cares about directly, but it affects prioritization for sure, especially for features like this that are probably many months of work.

No arguments here. I'm a big enough fan of self-contained headers and even C++ modules that I'd generally be willing to spend a non-trivial amount of time to clean up things. With projects like CPython, I suspect selling a major header file refactoring to CPython upstream might be an uphill battle, even if they see the value in good IDE support...

sliedes commented 4 months ago

FWIW my best guess is:

we'd want to build a preamble for the entrypoint file and then parse everything from there. This involves eliminating from many places the assumptions that a) the preamble is part of the file we care about, b) the file we care about is the "main file" as far as SourceManager is concerned.

a significant challenge with this solution is the performance where the fragment file has #includes which are not part of the entrypoint file's preamble

the include graph in the background index or the includer cache in TUScheduler are a reasonable initial step for identifying entrypoints but we'll eventually want something better built by fast include-scanning.

I now see that some of my previous comment's questions were addressed in this comment, which I somehow missed on the first skim... Sorry about the noise!

clangd / clangd

Support non-self-contained files #45