darold / squidanalyzer

Squid Analyzer parses Squid proxy access log and reports general statistics about hits, bytes, users, networks, top URLs, and top second level domains. Statistic reports are oriented toward user and bandwidth control.
http://squidanalyzer.darold.net/
125 stars 36 forks source link

Squidanalyzer very slow #198

Closed szaszg closed 5 years ago

szaszg commented 5 years ago

Thanks for the commit no. 842ade0.

When i try to check it, i found that: the git code is extremly slow compared with 6.6 (in changelog: 6.6 - Sun May 7 16:38:14 CEST 2017).

The "test" file (btw i just want to reparse all my logs but stuck at the first) contains 3_247_966 line (503_274_650 byte).

The "old" sa parse it about 5 min:

DEBUG: the log statistics gathering took:162 wallclock secs (158.71 usr 1.49 sys + 2.38 cusr 0.25 csys = 162.83 CPU) DEBUG: generating HTML output took:62 wallclock secs (52.18 usr + 2.24 sys = 54.42 CPU) DEBUG: total execution time:224 wallclock secs (210.89 usr 3.73 sys + 2.38 cusr 0.25 csys = 217.25 CPU)

But the "new" code parse the same file more than a half day!! (I started it yesterday, but not finished yet...) DEBUG: the log statistics gathering took:42525 wallclock secs (42420.47 usr 30.10 sys + 2.61 cusr 0.42 csys = 42453.60 CPU)

On the same HW/SW with the same config, the same log file.

The config is very similar to the default. I have only some exclude: NETWORK 10.1.250.0/24 NETWORK 127.0.0.1/8
URI ..kaspersky.com.
URI ..kaspersky-labs.com.
URI ..windowsupdate.com. URI ..microsoft.com. URI ..f-secure.com.

and three network aliases: EPROG 10.1.70.0/24
CT 10.1.80.0/24 OVSZK 10.1.60.0/24

szaszg commented 5 years ago

So, here is the end of the "new" run: DEBUG: generating HTML output took:11638 wallclock secs (11602.09 usr + 12.26 sys = 11614.35 CPU) DEBUG: total execution time:54163 wallclock secs (54022.56 usr 42.36 sys + 2.61 cusr 0.42 csys = 54067.95 CPU)

darold commented 5 years ago

Thanks for the report. I have revert this commit and apply a new one 5b00641 that might solve the performances issue. Let me know if that's still fix the regexp issue.

szaszg commented 5 years ago

Hello, Sorry, but IMHO not the last commit is the cause of the slowness. (with the new 5b00641 commit, the speed is very slow too)

The problem somewhere around the regexp handling, because without any regexp (without excluded and network aliases) execution time just a little bit longer than the old sa with excludes+net aliases: DEBUG: total execution time:234 wallclock secs (225.50 usr 5.17 sys + 2.37 cusr 0.20 csys = 233.24 CPU)

Old (with excludes and aliases): 217sec New(without any regexp): 234sec New(with only one network excl. - not regexp): 268sec New(with one URI regexp): made it very slow.. and because you leave out the \Q \E around the uri in grep... it crash again :(

The 'old' code is around 2018-03

darold commented 5 years ago

Hi,

Please can you try latest development code and let me know the performances of this new implementation? I have rewritten the way SquidAnalyzer is checking URI as the cache.

szaszg commented 5 years ago

At now its far better (with the same test file as before): DEBUG: total execution time:263 wallclock secs (252.59 usr 4.70 sys + 2.35 cusr 0.28 csys = 259.92 CPU)

Now 260sec vs. 217sec

Thanks a lot :)

darold commented 5 years ago

Thanks for the feedback, I'm closing the issue.