allinurl / goaccess

GoAccess is a real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser.
https://goaccess.io
MIT License
18.47k stars 1.11k forks source link

Scrambled browsers and OS aggregations #1806

Closed luchaos closed 4 years ago

luchaos commented 4 years ago

GoAccess - 1.4.

log-format COMBINED
html-prefs {"theme":"darkPurple","perPage":20,"layout":"vertical","showTables":true,"visitors":{"plot":{"chartType":"bar"}}}
agent-list false
with-output-resolver false
http-method yes
http-protocol yes
no-query-string true
anonymize-ip true
all-static-files false
ignore-crawlers false
crawlers-only false
ignore-panel REQUESTS_STATIC
ignore-panel NOT_FOUND
ignore-panel HOSTS
ignore-panel REFERRERS
ignore-panel REFERRING_SITES
ignore-panel KEYPHRASES
keep-last 30
real-os true
db-path /srv/stats/storage/data/tmp/

It looks like data is getting scrambled after a few iterations of running goaccess incrementally. Maybe I'm reading the reports incorrectly, but should it be possible that the total is over 100%? In the screenshots below you can see that Chrome versions are attributed to the Unknown and Crawlers list, Android Nougat escaping the Android list, etc. It happens a lot for the user agents in the custom browsers list, too, so I assume that there is some mismatch happening at the lower level? Maybe it's just the HTML output?

I'm running multiple incremental report generations at nginx logrotate time (daily) to have custom user agents aggregated vertically depending on use case. The assumption has been that there is no other way to make a distinction between multiple applications within a single user agent string - as that's usually not how they are interpreted but relevant in my particular use-case's domain. The logs in question aren't even that large and usually the script completes fairly quickly. db-path is being overridden by the individual calls as can be seen here:

#!/usr/bin/env sh
goaccess --no-progress --config-file=/srv/stats/conf/goaccess.conf --browsers-file=/srv/stats/conf/browsers/web.list --db-path=/srv/stats/storage/data/prod/web/ /var/log/nginx/web/access.log -o /srv/stats/public/prod-web.html --restore --persist --html-report-title="Web" --enable-panel=REQUESTS_STATIC --enable-panel=NOT_FOUND
goaccess --no-progress --config-file=/srv/stats/conf/goaccess.conf --browsers-file=/srv/stats/conf/browsers/api.list --db-path=/srv/stats/storage/data/prod/api/ /var/log/nginx/web-api/access.log -o /srv/stats/public/prod-api.html --restore --persist --html-report-title="API" --enable-panel=NOT_FOUND
goaccess --no-progress --config-file=/srv/stats/conf/goaccess.conf --browsers-file=/srv/stats/conf/browsers/clients.list --db-path=/srv/stats/storage/data/prod/connect-clients/ /var/log/nginx/web-connect/access.log -o /srv/stats/public/prod-connect-clients.html --restore --persist --html-report-title="Connect | Clients" --enable-panel=REQUESTS_STATIC --enable-panel=NOT_FOUND
goaccess --no-progress --config-file=/srv/stats/conf/goaccess.conf --browsers-file=/srv/stats/conf/browsers/integration.list --db-path=/srv/stats/storage/data/prod/connect-integration/ /var/log/nginx/web-connect/access.log -o /srv/stats/public/prod-connect-integration.html --restore --persist --html-report-title="Connect | Integration" --enable-panel=REQUESTS_STATIC --enable-panel=NOT_FOUND
goaccess --no-progress --config-file=/srv/stats/conf/goaccess.conf --browsers-file=/srv/stats/conf/browsers/cores.list --db-path=/srv/stats/storage/data/prod/connect-cores/ /var/log/nginx/web-connect/access.log -o /srv/stats/public/prod-connect-cores.html --restore --persist --html-report-title="Connect | Cores" --enable-panel=REQUESTS_STATIC --enable-panel=NOT_FOUND

Screenshot 2020-06-08 at 22 06 06 Screenshot 2020-06-08 at 22 07 06

luchaos commented 4 years ago

Additionally, as soon as the scrambling starts, there are errors in the browser's console: Error: <g> attribute transform: Expected number, "translate(NaN,0)". o @ prod-connect-integration.html:1 After a complete reset and running goaccess restore/persist for each gunzipped log file, there are no errors, no scrambled browsers nor OSs. Turns out they are still there, just not as many.

allinurl commented 4 years ago

Thanks for sharing this. Interesting, just to confirm, you are not running it in real-time? Are you getting percentages over 100 for single-level panels (i.e. no data within a particular item - requests, time distribution, etc)?

luchaos commented 4 years ago

Correct - those are all "static", based on incremental reads, no real-time.

There are other panels within those 5 reports where the totals don't match as can be seen in the screenshots/recordings below.

The labels seem to be referring to duplicate series within the charts. Some static files are in the list multiple times, the percentages too low to ever total to 100%. This happens in the Requested Files and Static Requests panels - not in the Browsers or OS panels (data is still scrambled/duplicated there).

I noticed that the date range in the upper right corner will not be shown if the Time Distribution panel has not been enabled. Might this be related to an unfortunate combination of panels which dictate how data is aggregated in the reports? E.g. the Hosts panel and HTTP Protocol have been disabled in all of them - does it miss particular dimensions to make up a clean data set because of that?

Screen Recording 2020-06-09 at 18 57 58

Screenshot 2020-06-09 at 18 37 38

Screenshot 2020-06-09 at 19 07 52

luchaos commented 4 years ago

Just to be sure, I ran a restore (with and without all the previously disabled panels enabled): goaccess --config-file=/srv/stats/conf/goaccess.conf --browsers-file=/srv/stats/conf/browsers/web.list --db-path=/srv/stats/storage/data/prod/web/ /var/log/nginx/web/access.log --restore --enable-panel REQUESTS_STATIC --enable-panel NOT_FOUND --enable-panel HOSTS --enable-panel REFERRERS --enable-panel REFERRING_SITES --enable-panel KEYPHRASES

After removing all db data, reading/persisting each single gunzipped file in descending order from the oldest to the newest, the browser errors that used to come up right after opening the page are gone but opening a Browsers' dropdown shows duplicates and the errors again. Some of the panels show more accurrate data, others more inaccurate (e.g. HTTP Status codes is at 132%).

The duplicates persist regardless of html or terminal output. So i assume the scrambling happened in the database after all.

scheb commented 4 years ago

@luchaos Did you find a solution to this problem in the meantime?

I have the same issue. Happened to me twice in the last few days since I started using GoAccess about a week ago. I'm using --restore + --persist for incremental updates. At some point browsers, OS and referer URLs are scrambled up (maybe some other things too, but these are the obvious ones). They show duplicates, browsers/OS are shown under the wrong category (e.g. Firefox user agents are shown under Chrome).

I started the report by reading in all the archived server logs. Then I have a cronjob running every hour to update the HTML report by reading from the current server log file.

The command I'm using for incremental update looks like this:

goaccess /var/log/nginx/example.org.access.log --persist --restore --db-path /var/www/usage-reports/example.org/ --ignore-referer=*example.org -o /var/www/usage-reports/example.org/index.html

Removing all the *.db files and re-reading all logs from scratch solves the issue for me. Then, incremental updates are working fine, until some point when the problem re-appears.

I've added logging now for the incremental updates, maybe I can see something from the log file the next time the issue occurs.

luchaos commented 4 years ago

@scheb unfortunately not :/

I have no idea when exactly the scrambling happens - just as you said - it runs ok for some time after a reset. In the end I didn't need the reports that clean/precise, just wanted to get a rough idea, but I'd still love to have a clean dataset at some point.

I tried a few different variants to see if it's because of any combinations of parameters - to no avail. Has to be something about how the aggregation into the db files works - might be just a simple off-by-one error even. E.g. it happens in the time distribution chart, too, where two datapoints are duplicated (05, 06) while two are missing (07, 08):

Screenshot 2020-09-28 at 22 04 43

I haven't found time yet to look at the code. Logging the incremental updates sounds good! I'd be highly interested in your findings :)

scheb commented 4 years ago

Well, the scrambling happened again, but my logs don't show anything other than successful parsings. I guess I'm storing a backlog of logs then and recreate the report from scratch every time.

jjgr86 commented 4 years ago

Tengo el mismo inconveniente, cuando utilizo las opciones de --persist --restore

esta

allinurl commented 4 years ago

@jjgr86 @luchaos I deployed a fix that could potentially address this issue when using --persist --restore. Could you please try building from development and let me know if it fixes the issue on your end?

Please note that unfortunately you may need to parse your log from scratch as it was not persisting to disk a hash table that keeps track of some sequences so you may not be able to recover from your existing db files.

jjgr86 commented 4 years ago

@allinurl solucione parte de mi archivo de configuración : /etc/goaccess/goaccess.conf

real-time-html true
daemonize true
log-file /var/log/apache2/4443.access.log
log-file /var/log/apache2/8181.access.log
log-file /var/log/apache2/80.access.log
output /var/www/report/index.html
db-path /tmp/goaccess
persist true
restore true

Pasos que me funcionaron para la solución

pkill goaccess
rm /var/www/report/index.html
rm /tmp/goaccess/*.db

comento en: /etc/goaccess/goaccess.conf

#daemonize true
#log-file /var/log/apache2/4443.access.log
#log-file /var/log/apache2/8181.access.log
#log-file /var/log/apache2/80.access.log

zcat 80.access.log-20201003.gz 8181.access.log-20201003.gz 4443.access.log-20201003.gz | goaccess -f 80.access.log-20201004 8181.access.log-20201004 4443.access.log-20201004 --persist -o /var/www/report/index.html -

descomento

daemonize true
log-file /var/log/apache2/4443.access.log
log-file /var/log/apache2/8181.access.log
log-file /var/log/apache2/80.access.log
output /var/www/report/index.html

ejecuto goaccess

el error se generaba cuando ejecutaba --persist y no comentaba los log-file que tenia configurado , cuando volvía a ejecutar goaccess duplicaba algunos registros

Screenshot (1)

allinurl commented 4 years ago

@jjgr86 thanks for posting these findings. Just to be sure, did you build from development?

jjgr86 commented 4 years ago

@ jjgr86 gracias por publicar estos hallazgos. Solo para estar seguro, ¿se basó en el desarrollo?

no, lo hice con la que ya tenia instalada, desde el repositorio backports de Debian 10.6

jjgr86 commented 4 years ago

@allinurl también me fije que deja de registrase la estadísticas al rotarse los log de apache2, así que modifique el archivo /etc/logrotate.d/apache2 # image

agregando el comando pkill goaccess en prerotate y goacces en postrotate en el archivo de configuración de rotación de logs de apache2 /etc/logrotate.d/apache2

Quizás pertenezca a otro hilo pero dejo este comentario por aquí!!!

allinurl commented 4 years ago

@jjgr86 Thanks and feel free to build from development, it will be very useful to know if the latest changes work fine on your end.

That issue with rotating logs has been fixed upstream. It will be pushed in the upcoming release. Stay tuned!

luchaos commented 4 years ago

@jjgr86 @luchaos I deployed a fix that could potentially address this issue when using --persist --restore. Could you please try building from development and let me know if it fixes the issue on your end?

Please note that unfortunately you may need to parse your log from scratch as it was not persisting to disk a hash table that keeps track of some sequences so you may not be able to recover from your existing db files.

Thanks @allinurl ! Built goaccess from source and removed existing db data. I'll let it run for a few days and get back to you with results.

allinurl commented 4 years ago

@luchaos looking forward to it :)

luchaos commented 4 years ago

@allinurl tl;dr: Thank you! Works perfectly now.

Running the project's logrotate script, which runs 5 different --restore --persist routines, several times manually and having it run twice at logrotate time so far resulted in no errors whatsoever. Even after mixing up those manual and automated runs there was no desync/scrambling, the aggregations are clean. An error would have shown itself by now going by the previous version's results.

I'd consider this issue resolved - as soon as it's included in an upcoming release the latest. This fix made me happier than you might've assumed :) Thank you for this great tool!

allinurl commented 4 years ago

@luchaos That's great to hear. Thanks for the update. I'll be pushing out the upcoming release in the next few days. Stay tuned.

Closing this, feel free to reopen it if needed.