distributed-system-analysis / pbench

A benchmarking and performance analysis framework
http://distributed-system-analysis.github.io/pbench/
GNU General Public License v3.0
188 stars 108 forks source link

indexer.py memory bloat in parsing large results.json file #1777

Open dbutenhof opened 4 years ago

dbutenhof commented 4 years ago

The standard JSON package reads the entire file to validate and parse the structure, and can take a lot of memory.

Research alternatives, including streaming JSON parsers or even custom parsing. The JSON we're reading is just a list of objects; we want to parse and validate each list item, but we could be less discriminating about the outer list if there's a more efficient alternative.

npalaska commented 4 years ago

I tried to do some investigation around this issue. There are couple of points

  1. We need an iterative streaming parser for JSON files that are large enough for in memory loading.
  2. However, when we are parsing iteratively we might have to change some of the behavior in our code, it would be tricky to do something like these https://github.com/distributed-system-analysis/pbench/blob/d8f835dd81abf6084c807c5caa507ceb34f9fae6/lib/pbench/server/indexer.py#L133-L134 and https://github.com/distributed-system-analysis/pbench/blob/d8f835dd81abf6084c807c5caa507ceb34f9fae6/lib/pbench/server/indexer.py#L170-L186 this is where we try to get all the keys at once or we use the extracted json dictionary and put it into another dictionary (template body). I think this would mean loading the file again into the memory since we can not add iterative object into the template body
  3. There is a nice python package called ijson which is built on the popular YAJL json iterative parser library, but using it means we would rely on 3rd party library or otherwise we might have to write our own wrapper similar to ijson (might require significant efforts in it). Any ideas about implementing a wrapper are welcome
  4. Does it make sense if we could use sqllite like database in future if json files starts to grow quite big? so that way we dont have to load everything in memory and can store everything on disk

Thoughts?