Ericsson / CodeCompass

CodeCompass is a software comprehension tool for large scale software written in C/C++ and Java
https://codecompass.net
GNU General Public License v3.0
520 stars 102 forks source link

Change working directory of C++ parser #572

Open andocz opened 2 years ago

andocz commented 2 years ago

This PR makes Clang tool invocations use the working directories of each compilation command. This fixes relative paths in arguments not being handled properly (#571), as previously the tool ran in the default working directory of ..

Callers of SourceManager::getFile in cppparser were changed to provide an absolute path.

Additionally, inaccurate comments in SourceManager::getFile were corrected.

andocz commented 2 years ago

It turns out there's a bug I didn't catch: if the build directory doesn't exist, the parsing fails, because RealFileSystem requires that the chdir target exists. To fix this, I'll have to reimplement RealFileSystem without this check.

whisperity commented 2 years ago

It turns out there's a bug I didn't catch: if the build directory doesn't exist, the parsing fails, because RealFileSystem requires that the chdir target exists. To fix this, I'll have to reimplement RealFileSystem without this check.

I do not understand this. How is it possible to have a build generated with CMake (which creates directory structures) or logged (in which case the build was executed and should have created the directories) but do not have the "build directory" (what does this refer to?) exist?

Is it meaningful to parse a file in such a context? If the directory where the build was executed (is this what "build directory" refers to?) doesn't even exist, how was the build executed? We'll end up having issues with (potentially generated?) files gone missing...

In general, I would prefer if we tried not to reimplement things from LLVM too much, unless really needed. We're way behind the upstream and all these reimplements will cause problems when we try supporting newer releases of Ubuntu even.

Perhaps we could either create the directory, or give a hard error and have the user reconsider what they messed up with their build system...

whisperity commented 2 years ago

Okay, I tried giving a second read to the patch itself but I am still confused. What do we need the chdir for, and why do we try chdiring to that directory? Or is that chdir needed by Clang instead? What about VirtualFileSystem? Can't we create a VirtualFileSystem that always says "Yes, sir, this directory surely exists!" in case it doesn't really exist on the real device, and overlay that on top of the RFS in the VFS stack the ClangTool uses and then make clang believe that directory is there?

andocz commented 2 years ago

How is it possible to have a build generated with CMake (which creates directory structures) or logged (in which case the build was executed and should have created the directories) but do not have the "build directory" (what does this refer to?) exist?

It can happen if the build directory is selectively transferred between computers, like what happens on the CI where we zip up just the source files. The build directory (where CMake was invoked) exists here actually, but the compilation directory ("directory" in compile_commands.json) of some commands doesn't, since directories containing no source files go missing.

We'll end up having issues with (potentially generated?) files gone missing...

We can have the compilation dependencies all intact with only compilation directories containing none missing, which is what happened on the CI.

In general, I would prefer if we tried not to reimplement things from LLVM too much, unless really needed.

Oh, I didn't mean reimplementing it from scratch, just creating a wrapper FileSystem subclass that adjust arguments before passing them on to the methods of a RealFileSystem instance, converting relative paths to absolute ones. It would work like ProxyFileSystem.

Or is that chdir needed by Clang instead?

Yes, it chdirs to the directory of each compilation command before running the tool here. This is what will make future references to relative paths get resolved correctly.

Can't we create a VirtualFileSystem that always says [...]

I don't think that would work. RealFileSystem uses llvm::sys::fs::is_directory to verify that the work dir exists; the ClangTool's stack isn't used at this level.

But even if it worked, I think my "redirecting wrapper class" idea is more straightforward.

whisperity commented 2 years ago

Oh, there is ProxyFileSystem now... neat. Back when I implemented reparse, that wasn't a thing yet. The setCWD would've worked though, because that's also what reparse uses:

https://github.com/Ericsson/CodeCompass/blob/83a6a9a724892f458060e1a39b89a5383d381102/plugins/cpp_reparse/service/src/databasefilesystem.cpp#L248-L261

And in this case, almost none (apart from Clang's own intrinsic headers that install with CodeCompass...) of the directories exist on the server. And the overlay is passed to ClangTool:

https://github.com/Ericsson/CodeCompass/blob/83a6a9a724892f458060e1a39b89a5383d381102/plugins/cpp_reparse/service/src/reparser.cpp#L174-L176

andocz commented 2 years ago

The reason missing directories never caused an issue with reparse is because it doesn't set the working directory properly. It's using the default value of . from FixedCompilationDatabase::loadFromCommandLine. It's a good thing you mentioned reparse, because I realized it's still broken with relative paths.

DatabaseFileSystem doesn't actually use its _currentWorkingDirectory for anything. It passes the paths it receives as-is to getFile. Also, setCurrentWorkingDirectory doesn't work properly if path_ is absolute; the path_ gets appended to the work dir instead of being used in itself. These will have to be corrected.

andocz commented 2 years ago

The reparser properly supports relative paths now. I had to add a new table to the database (BuildDirectory) for this.

I decided to scrap the FakeWorkDirFileSystem, because to make it work properly, it would need to resolve ..s in paths lexically, which is too messy for my liking. (I can think of symlinks as a case when this would produce wrong results.) Now if the build directory is missing, a warning is emitted and / is used as a fallback.

For parsing to fail, a build command needs to have its directory missing and use a relative path. I don't think it's worth going out of our way to recover from this scenario given how unlikely it is. You suggested creating the build directory, but I think the parser shouldn't modify its inputs.

mcserep commented 7 months ago

The PR should also handle relative paths for the file in the compilation command source.

As part of #637, it was discover that in case the directory is /source/, the file is file.cpp and the execution directory of CodeCompass_parser is /cc/, then instead of /source/file.cpp, the non-existent file /cc/file.cpp is being parse, leading to error.