Improve performance / resource usage for big codebases

Trass3r commented 5 years ago

The analyzer can't read large json files due to this code:

diff --git a/src/main.cpp b/src/main.cpp
index 380b26f..fdfafbc 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -26,7 +26,7 @@ static std::string ReadFileToString(const std::string& path)
     if (!f)
         return "";
     fseek(f, 0, SEEK_END);
-    size_t fsize = ftell(f);
+    size_t fsize = _ftelli64(f);
     fseek(f, 0, SEEK_SET);
     std::string str;
     str.resize(fsize);

Maybe it should just use memory-mapped files.

Trass3r commented 5 years ago

Also it uses loads of memory for such large codebases, around 27GB for clang.

aras-p commented 5 years ago

Yeah, changed to (turns out, different on each OS!) variants of ftell just now, thanks.

I have some thoughts about improving performance and reducing memory usage for very large codebases, but didn't get to do that yet.

JVApen commented 5 years ago

I've encountered similar issues when trying this. In order to get this partially working, I had to rewrite parts of the code. My changes ain't ready to share, as I mainly hacked it in. Not sure if I even want to clean it up.

Some elements I found (runStop):

The program takes everything into memory. Instead of reading all files and putting it into memory, I read them once for the filtering and stored the file name iso the content. On generation, I simply read it a second time and immediately write it to a std::fstream.
I've replaced the std::map of the JsonFileFinder with a std::vector<std::pair<string, string>>, I guess std::vector could do?
Although previous changes made it possible to generate the combined file, it was still almost out-of-memory on a powerful machine. I would wonder if it doesn't make more sense in simply writing it while reading, so you don't need a state.

The analyze phase currently takes way too much memory to be usable (combined file of 40GB causes a memory usage of 220GB). From what I can see, to support this kind of magnitude, the choice of JSON parser ain't OK. As it's a DOM-parser, it first needs to get the complete file in memory and then translate it into a tree (takes even more space) before any preprocessing can be done. If you really want to support big files, I think you'll need a SAX parser

aras-p commented 5 years ago

I was thinking that majority of space in the combined JSON file (or even in single JSON file) is redundant strings, e.g. full paths to where exactly <vector> was, over and over again.

My plan is to at some point making the "smash all jsons into one huge json" a bit more intelligent. It could de-duplicate strings and just store their IDs, and a table of ID->string elsewhere. Maybe then it would not be as huge.

aras-p commented 4 years ago

@Trass3r @JVApen I did a bunch of changes (memory handling, threading, ...) that made the thing 2x faster, use 10x less memory and the data file is 4x smaller in my tests, see #37 -- plan to merge it to master branch soon.

aras-p commented 4 years ago

Merged the above to master, should be better than previously. If still issues on your codebases, please reopen!

aras-p / ClangBuildAnalyzer

Improve performance / resource usage for big codebases #10