DevMine / srcanlzr

Tool to analyze source code repositories
http://devmine.ch/doc/srcanlzr/
BSD 3-Clause "New" or "Revised" License
5 stars 1 forks source link

Add support for compressed JSON (gz or bz2) in input #1

Open gilliek opened 9 years ago

gilliek commented 9 years ago

Since we are dealing with a huge amount of data, it is very slow to re-parse all the projects with the source code parsers everytime we update the source analyzer. Thus, it makes sense to store the intermediate JSON. However, the JSON files are really big and they use a lot of disk space so it would be useful to compress them.

rolinh commented 9 years ago

Mmh, what is preventing us to store the intermediate JSON in a compressed format and then un-compress it and stream it to srcanlzr directly? Like why does it need to be handled by srcanlzr?

gilliek commented 9 years ago

Nothing is preventing us to do so. The main advantage of doing it directly in Go is performance IMHO. The bzip2 package of the Go standard library (http://golang.org/pkg/compress/bzip2/) implements the reader interface and the JSON decoder can directly read JSON from a reader. That way, the JSON decoder can uncompress and decode the JSON at the same time.

Besides, it only takes few lines of code to implement that option. Since everything comes from the standard library, it does not require extra testing. So I see no reason not to implement it :)

rolinh commented 9 years ago

Fair enough. It'll be interesting to micro-benchmark using something like bzcat foo.json.bz2 | srcanlzr ... vs having srcanlzr handle it all through bzip2 from the standard library using the reader interface. Just out of curiosity. :)

gilliek commented 9 years ago

Yeah for sure :)

gilliek commented 9 years ago

I bet that the pure Go version will be faster. Even if the Go standard implementation is much slower than bzcat(1), in the end, the bzcat solution will need to read the bzipped file, output the uncompressed JSON and srcanlzr will have to read it, instead of just reading the bzipped file once :)