There are two main problems with the current code base:
it takes lots of memory
it is super slow
The memory usage can be fixed by using a data pipeline: rather than analyzing all lines and only afterwards processing them with the filters and commands, analyze and process each line. This way, lines can be discarded right away and memory consumption is kept low.
This means refactoring the commands in a way so that they can collect the data for each line and after all data is collected, compute the results (if any needs to be computed).
The slowness, after analyzing it with cProfile and snakeviz, seems to be bound to the huge regular expression used to parse each line. Rather than trying to optimize that, which something was done on that front anyway, the main win is to use parallel processing, i.e. multiprocessor.Pool.
As an example, a 600k line long takes just too much memory to run on my beefy laptop, once the memory usage refactoring is done, it takes almost no memory but around ~6 minutes to process.
After the multiprocessing, those 6 minutes get down to 2 minutes.
Coverage increased (+0.06%) to 100.0% when pulling 30ed36ff8e97136a9b5a535255781aef599f2e17 on yield-lines into 468bedc8de5c6d01f8efb5e6150d1c1ff88e0af6 on master.
There are two main problems with the current code base:
The memory usage can be fixed by using a data pipeline: rather than analyzing all lines and only afterwards processing them with the filters and commands, analyze and process each line. This way, lines can be discarded right away and memory consumption is kept low.
This means refactoring the commands in a way so that they can collect the data for each line and after all data is collected, compute the results (if any needs to be computed).
The slowness, after analyzing it with
cProfile
andsnakeviz
, seems to be bound to the huge regular expression used to parse each line. Rather than trying to optimize that, which something was done on that front anyway, the main win is to use parallel processing, i.e.multiprocessor.Pool
.As an example, a 600k line long takes just too much memory to run on my beefy laptop, once the memory usage refactoring is done, it takes almost no memory but around ~6 minutes to process. After the multiprocessing, those 6 minutes get down to 2 minutes.