allinurl / goaccess

GoAccess is a real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser.
https://goaccess.io
MIT License
18.29k stars 1.1k forks source link

Duplicate entries with piped data in real-time HTML report #1825

Open jjlin opened 4 years ago

jjlin commented 4 years ago

I'm running goaccess 1.4 via the official Docker image.

Command:

tail -F /goaccess/logs/access_log | goaccess --no-global-config --config-file=/goaccess/goaccess.conf

(access_log is rotated daily via rotatelogs.)

Conf file:

date-format %d/%b/%Y
time-format %H:%M:%S
log-format %h %^ %^ [%d:%t %^] "%r" %s %b "%R" "%u"

output /var/www/stats/index.html
real-time-html true
db-path /goaccess/logs/persist
persist true
restore true
keep-last 180

I'm seeing duplicate entries after restoring persisted data. Some examples are shown below. I just started integrating goaccess a few days ago, so I've never had a fully working configuration. Also, the dashboard shows Total Requests: 117,118, but Valid Requests: 22,384 and Failed Requests: 0. Total should be the sum of valid and failed, right?

Time distribution

Annotation 2020-06-23 224000

Requested files

Annotation 2020-06-23 224124

jjlin commented 4 years ago

This looks similar to #1806 actually, and might have the same root cause. Feel free to close this one as duplicate if you think it's appropriate.

allinurl commented 4 years ago

Great question, please take a look at this post and notes about piping data for incremental processing.

Total requests, in your case, counts how many requests have been parsed overall, on the other hard, valid requests is a count of how many have been processed. The number can greatly vary when piping the same data over and over as it will parse all those entries, but duplicates won't counted towards the valid count.

Let me know if that helps.

This looks slightly different than #1806.

jjlin commented 4 years ago

From your link, I can see how there might be some potential double-counting, though I think this would be quite limited in my case since tail -F will only replay at most 10 duplicate lines by default. The difference between total (~117K) and valid (~22K) is pretty large though, and I only stopped/restarted goaccess a few times. Does the total include the entries restored from persisted data or something?

Also, the fact that the charts are showing duplicate entries shouldn't be related to processing duplicate log entries, right? For example, the time distribution chart shows 3 entries for 13 and 2 entries each for 14 and 15. It isn't obvious unless you look closely, but this results in multiple points being plotted for those values.

0bi-w6n-K3nobi commented 4 years ago

It is so weird. The accounting via PIPE/STDIN is alway exact. The TAIL -F command is perfect in Linux/Unix and won't produce duplicate lines.