gforcada / haproxy_log_analysis

HAProxy log analyzer
https://pypi.org/project/haproxy_log_analysis
GNU General Public License v3.0
89 stars 35 forks source link

Parallel processing and low memory usage #35

Closed gforcada closed 4 years ago

gforcada commented 4 years ago

There are two main problems with the current code base:

The memory usage can be fixed by using a data pipeline: rather than analyzing all lines and only afterwards processing them with the filters and commands, analyze and process each line. This way, lines can be discarded right away and memory consumption is kept low.

This means refactoring the commands in a way so that they can collect the data for each line and after all data is collected, compute the results (if any needs to be computed).

The slowness, after analyzing it with cProfile and snakeviz, seems to be bound to the huge regular expression used to parse each line. Rather than trying to optimize that, which something was done on that front anyway, the main win is to use parallel processing, i.e. multiprocessor.Pool.

As an example, a 600k line long takes just too much memory to run on my beefy laptop, once the memory usage refactoring is done, it takes almost no memory but around ~6 minutes to process. After the multiprocessing, those 6 minutes get down to 2 minutes.

coveralls commented 4 years ago

Coverage Status

Coverage increased (+0.06%) to 100.0% when pulling 30ed36ff8e97136a9b5a535255781aef599f2e17 on yield-lines into 468bedc8de5c6d01f8efb5e6150d1c1ff88e0af6 on master.

gforcada commented 4 years ago

Released as 4.0.0 :tada: