Closed Trass3r closed 4 years ago
Also it uses loads of memory for such large codebases, around 27GB for clang.
Yeah, changed to (turns out, different on each OS!) variants of ftell
just now, thanks.
I have some thoughts about improving performance and reducing memory usage for very large codebases, but didn't get to do that yet.
I've encountered similar issues when trying this. In order to get this partially working, I had to rewrite parts of the code. My changes ain't ready to share, as I mainly hacked it in. Not sure if I even want to clean it up.
Some elements I found (runStop):
The analyze phase currently takes way too much memory to be usable (combined file of 40GB causes a memory usage of 220GB). From what I can see, to support this kind of magnitude, the choice of JSON parser ain't OK. As it's a DOM-parser, it first needs to get the complete file in memory and then translate it into a tree (takes even more space) before any preprocessing can be done. If you really want to support big files, I think you'll need a SAX parser
I was thinking that majority of space in the combined JSON file (or even in single JSON file) is redundant strings, e.g. full paths to where exactly <vector>
was, over and over again.
My plan is to at some point making the "smash all jsons into one huge json" a bit more intelligent. It could de-duplicate strings and just store their IDs, and a table of ID->string elsewhere. Maybe then it would not be as huge.
@Trass3r @JVApen I did a bunch of changes (memory handling, threading, ...) that made the thing 2x faster, use 10x less memory and the data file is 4x smaller in my tests, see #37 -- plan to merge it to master branch soon.
Merged the above to master, should be better than previously. If still issues on your codebases, please reopen!
The analyzer can't read large json files due to this code:
Maybe it should just use memory-mapped files.