Clangd should infer compile command for headers

Quuxplusone commented 6 years ago


Bugzilla Link	PR36899
Status	NEW
Importance	P enhancement
Reported by	Sam McCall (sammccall@google.com)
Reported on	2018-03-26 01:15:15 -0700
Last modified on	2019-05-20 05:06:08 -0700
Version	unspecified
Hardware	PC Linux
CC	klimek@google.com, llvm-bugs@lists.llvm.org, marc-andre.laperle@ericsson.com, simon.marchi@polymtl.ca, tdhutt@gmail.com, zeratul976@hotmail.com
Fixed by commit(s)
Attachments	`example.tgz` (523 bytes, application/x-compressed-tar)
Blocks
Blocked by
See also

If you have a compilation database (e.g. compile_commands.json) that provides
compile commands for headers, then clangd works as expected. However most build
systems don't provide this information. (Bazel, for one, does).

When there's no compile command provided, we fall back to 'clang $filename'.
Clang treats .h files as C.
But it's also missing a bunch of other information: include paths, defines etc
that are likely required to get useful results.
So I don't think just diverging from clang here will actually help many
projects (feel free to try this out by editing compile_commands.json - if this
works for you it'd be good to know).

What can we do then? A few ideas:
 - we can preprocess all the files from compile_commands.json on startup
(https://reviews.llvm.org/D41911 is the start of this). But we can't guarantee
we get to the file you care about in time, so behavior will be erratic.
 - we can pick a compile command arbitrarily and take its flags, which will get
include paths and defines right if they're uniform across the project
 - we can just refuse to provide any diagnostics/completions where we don't
have a known-good set of flags.

There are edge cases where we'll never do a good job: non-modular headers that
can't be compiled standalone, or headers that get included in different
contexts and behave differently in each. It's probably not worth worrying about
these at all.

Quuxplusone commented 6 years ago

There are edge cases where we'll never do a good job: non-modular headers that can't be compiled standalone, or headers that get included in different contexts and behave differently in each. It's probably not worth worrying about these at all.

I don't think these use cases are that rare. In the GDB source code, for example, it is the convention that each .c file first includes "defs.h", a file that provides some very common types and definitions. Other headers don't include this file, they always assume that the "CORE_ADDR" type is available, for example. I know this goes against the "include what you use" principle, but that's how some projects work in practice, and it would be nice to support them.

We could analyze headers in the context where they are used. If the header file being opened is included in other files, we can choose one of these files, and do the analysis on that file instead. We would then only emit the diagnostics that are in the opened header file. When doing "findDefinitions", we would also work with the AST of the file including our header file. That would give us a view of the header file as included from a given compilation unit.

Eventually, I'd like if we could let the user switch between the different "views" of a header file. If foo.h is included in foo.c and bar.c, it would be cool if the user could switch between "foo.h seen as included from foo.c:10" and "foo.h seen as included from bar.c:15", which can give different results.

Quuxplusone commented 6 years ago

This might be a difference between C and C++ - C++ as a language is moving in the direction of modules, which have include-what-you-use semantics. Code bases that want to move towards modules do have strict rules for headers today, and want to see and work with headers under the assumption that they stand on their own.

Quuxplusone commented 6 years ago

It would be nice to support such cases.

It doesn't fit easily into the current model of "a file has a command that can
be used to compile it" - we need to ship around preprocessor state or even a
preamble. So this is probably a fair amount of effort.

Any include-scanning-based method isn't going to cover cases where files
outside the compilation database are added or renamed. (In fact this goes for
implementation files too, not just headers).

So I'm leaning toward *in any case* adding some path-based heuristics for
matching compilation database entries, even if we want an include-scanning
approach. In terms of sequencing, the path heuristics are less complex/invasive
and will show us how much of the problem remains, so they seem like a good
first step.

> Eventually, I'd like if we could let the user switch between the different
"views" of a header file.

This definitely seems useful in the context of non-modular headers, and is in
some sense needed for correct behavior in that case. How do you see it mapping
onto LSP?

Quuxplusone commented 6 years ago

Experiences/suggestions for heuristic approach on the parallel mail thread:
http://lists.llvm.org/pipermail/cfe-dev/2018-March/057386.html

Quuxplusone commented 6 years ago

(In reply to Manuel Klimek from comment #2)
> This might be a difference between C and C++ - C++ as a language is moving
> in the direction of modules, which have include-what-you-use semantics. Code
> bases that want to move towards modules do have strict rules for headers
> today, and want to see and work with headers under the assumption that they
> stand on their own.

That's true. We definitely want to support older projects and ones that do not
adopt modules or include-what-you-use semantics though.

(In reply to Sam McCall from comment #3)
> Any include-scanning-based method isn't going to cover cases where files
> outside the compilation database are added or renamed. (In fact this goes
> for implementation files too, not just headers).

The index (index-store or otherwise) is going to contain all file dependency
information and I was thinking of watching those files, server side. The only
case that wouldn't work nicely (I think) is when a header is added and an
unresolved include in a source files becomes resolved without any changes to
that source file. But I think it's pretty rare and as soon as the file is
compiled, the index-store's unit will be modified and clangd will be up to date
again. I can think of other ways to mitigate that too (with the assistance of
the client for example).

> So I'm leaning toward *in any case* adding some path-based heuristics for
> matching compilation database entries, even if we want an include-scanning
> approach. In terms of sequencing, the path heuristics are less
> complex/invasive and will show us how much of the problem remains, so they
> seem like a good first step.

Completely agree! The heuristic that is now being done is a huge improvement!
Do you think it's OK if I make a proof-of-concept with the include-scanning
approach for the longer term? Or you really prefer not using that approach?

Quuxplusone commented 6 years ago

(In reply to Marc-André Laperle from comment #5)
> (In reply to Manuel Klimek from comment #2)
> > This might be a difference between C and C++ - C++ as a language is moving
> > in the direction of modules, which have include-what-you-use semantics. Code
> > bases that want to move towards modules do have strict rules for headers
> > today, and want to see and work with headers under the assumption that they
> > stand on their own.
>
> That's true. We definitely want to support older projects and ones that do
> not adopt modules or include-what-you-use semantics though.

That makes sense. There's lots of tradeoffs here, and questions of how far you
go down this path. For example:
 - describing the interpretation of a file using a compiler invocation with a set of flags is a fairly simple and transparent model. It can approximate some interesting preprocessor state (e.g. injecting -D and -include flags) but not perfectly, and not everything.
 - mapping each file onto a single interpretation isn't completely accurate, but is easier for clangd, editors, and users to understand in most cases.

So my own bias is to aim for a simple basic design centered around (dare I say
it) modern C++, and make tactical concessions to the other things we see in
practice. But examples will enlighten here.

> (In reply to Sam McCall from comment #3)
> > Any include-scanning-based method isn't going to cover cases where files
> > outside the compilation database are added or renamed. (In fact this goes
> > for implementation files too, not just headers).
>
> The index (index-store or otherwise) is going to contain all file dependency
> information and I was thinking of watching those files, server side. The
> only case that wouldn't work nicely (I think) is when a header is added and
> an unresolved include in a source files becomes resolved without any changes
> to that source file. But I think it's pretty rare and as soon as the file is
> compiled, the index-store's unit will be modified and clangd will be up to
> date again. I can think of other ways to mitigate that too (with the
> assistance of the client for example).
This all sounds good.

(The problem I was alluding to above is creating foo.cpp and starting to edit
it without adding any build rules for it - we have no root, so include scanning
does nothing there).

>
> > So I'm leaning toward *in any case* adding some path-based heuristics for
> > matching compilation database entries, even if we want an include-scanning
> > approach. In terms of sequencing, the path heuristics are less
> > complex/invasive and will show us how much of the problem remains, so they
> > seem like a good first step.
>
> Completely agree! The heuristic that is now being done is a huge
> improvement! Do you think it's OK if I make a proof-of-concept with the
> include-scanning approach for the longer term? Or you really prefer not
> using that approach?

Definitely in favor of this.
In fact I did a prototype of this myself (https://reviews.llvm.org/D41911, plus
a little ad-hoc code to hook it up to the GlobalCompilationDatabase).

Some things to consider:
 - we should understand how this is going to play with changes, both to source files and to the compilation database itself. Getting clangd to pick up compilecommand changes is a high priority IMO, which I'd like to address soon.
 - you can include-scan LLVM in a minute or so (run PP only), but if you want to store the results in the index it complicates the consistency, as other index data needs a real compile (much slower)
 - there are CompilationDatabase implementations where getCompileCommands is slow (this is currently the only opportunity to prepare genfiles for large code bases). These are probably also the ones where getAllCompileCommands() doesn't return anything, so it's probably ok :)

Quuxplusone commented 6 years ago

(In reply to Sam McCall from comment #6)
> So my own bias is to aim for a simple basic design centered around (dare I
> say it) modern C++, and make tactical concessions to the other things we see
> in practice. But examples will enlighten here.

In think the GDB example is good example to test with.

> > (In reply to Sam McCall from comment #3)
> (The problem I was alluding to above is creating foo.cpp and starting to
> edit it without adding any build rules for it - we have no root, so include
> scanning does nothing there).

I think it's fair that the user would expect this not to work until the build
system knows about it. But perhaps as a heuristic it can try to use a sibling
source files (if any) and use its args.

> Definitely in favor of this.
> In fact I did a prototype of this myself (https://reviews.llvm.org/D41911,
> plus a little ad-hoc code to hook it up to the GlobalCompilationDatabase).

Interesting, I'll take a look!

> Some things to consider:
>  - we should understand how this is going to play with changes, both to
> source files and to the compilation database itself. Getting clangd to pick
> up compilecommand changes is a high priority IMO, which I'd like to address
> soon.

Yes, detecting compile command changes is higher priority for sure. In the end,
both kind of changes would trigger a re-index on the affected file(s).

>  - you can include-scan LLVM in a minute or so (run PP only), but if you
> want to store the results in the index it complicates the consistency, as
> other index data needs a real compile (much slower)

I think a two pass strategy would make sense, as mentioned in previous
discussions. Not sure its compatible with the current index-store proposal
though. It think it's worth experimenting with this.

Quuxplusone commented 6 years ago

Update from offline discussion:

A fully accurate model still based on CompileCommand would be:
 - find an including file
 - transfer its command for compiling the header
 - using VFS overlay, prepend the section of the including file up to the relevant #include

Doing this naively would mess up offsets, so some options we discussed instead:
 - put that prefix in a virtual file, and add -include to the header's compile command.
   (It's unclear how this will interact with preamble handling)
 - truncate the including file after the #include, and use that as the entrypoint
   (But we treat the main-file specially in lots of places)

It's probably possible to refine these ideas into a primitive that translates a
CompileCommand for an including file, and the content up to the #include, into
a CompileCommand for the included file. This would probably involve restoring
the overlaid files support removed in r303635.

Quuxplusone commented 6 years ago

Attached example.tgz (523 bytes, application/x-compressed-tar): Example code

Quuxplusone commented 6 years ago

I made a small example (the attached example.tgz) and will try to explain how I
understand we would treat it.  This way we can confirm that we are all on the
same page, and have a concrete example to talk about.

Let's say you edit the "bar.h" file from the example.  clangd would find one
arbitrary location where this file is included.  In the example, there's only
one:

  from foo.h:2
  from main.c:4

Therefore, we would create a virtual file composed of all the contents of these
files (starting at the bottom of the stack) that precede the inclusion points,
followed by the content of bar.h itself.  So it would give:

~~~
#include <stdio.h>

#define FOO 3
#define BAR 4
static int bar(int y) {
    return y + FOO + BAR;
}
~~~

And that would be the file we would analyze in clangd.  However, to get the
right locations and line numbers in the diagnostics, we could add some line
markers:

~~~
# 1 "main.c"
#include <stdio.h>

#define FOO 3
# 1 "foo.h" 1
#define BAR 4
# 1 "bar.h" 1
static int bar(int y) {
    return y + FOO + BAZ;
}
~~~

Such that if I make a typo and mispell "BAR" as "BAZ" in bar.h, I'll get this
error message, with the right include stack:

In file included from main.c:4:
In file included from foo.h:2:
bar.h:2:19: error: use of undeclared identifier 'BAZ'
        return y + FOO + BAZ;
                         ^

Since the diagnostics will be influenced by which particular inclusion of that
header we choose, I think it's necessary to inform the user of what's the
inclusion path that lead to this diagnostic.

Quuxplusone commented 5 years ago

Worth noting that Qt Creator handles this properly - it even detects if a header file is #included in multiple different contexts: https://doc.qt.io/qtcreator/creator-coding-navigating.html#selecting-parse-context

Doing this naively would mess up offsets

Can't you use the #line directive to fix that?

Quuxplusone / LLVMBugzillaTest

Clangd should infer compile command for headers #35872