Currently, we have ~150k files and ~150k directories of JSON files for our code reviews. Parsing takes about 2-3 hours currently of these code reviews into CSVs that then get dumped into the database. Unfortunately, file IO is a huge part of this. For the bug data, we discovered that parsing goes very fast if we load the data 10mb at a time and parse them that way. So, here's the big refactoring:
Write a script to collect the production data place them into chunk files. Currently, we've "chunkified" the JSON files into different directories - just collect these into a single file.
The script should also include patchset files in the appropriate place. So a JSON file looks like an array of code reviews, and then the part with patchset data gets included with its own key that we then parse.
Update the code review parser to parse these new chunk files.
Document the new process of scraping code reviews: use the scraper to download everything, then use the chunkifier script to collect them into groups of about 500 a piece
Keep the verifies running - we don't need any new verifies for this task.
This is also a big task, so I'll make sure the development is done in a separate branch so it doesn't break the daily build. I would also like to take this on.
Currently, we have ~150k files and ~150k directories of JSON files for our code reviews. Parsing takes about 2-3 hours currently of these code reviews into CSVs that then get dumped into the database. Unfortunately, file IO is a huge part of this. For the bug data, we discovered that parsing goes very fast if we load the data 10mb at a time and parse them that way. So, here's the big refactoring:
Keep the verifies running - we don't need any new verifies for this task.
This is also a big task, so I'll make sure the development is done in a separate branch so it doesn't break the daily build. I would also like to take this on.