froxlor / Froxlor

The server administration software for your needs - The official Froxlor development Git repository
http://www.froxlor.org
GNU General Public License v2.0
1.64k stars 458 forks source link

show the WHOLE webserver traffic in froxlors statistics #884

Open rseffner opened 4 years ago

rseffner commented 4 years ago

Summary

At the moment only outgoing webserer traffic with known http status codes is calculated for froxlors traffic statistics. With changes in apaches LogFormat an parsing awstats data-file we should be able to count ALL the traffic going through the webserver.

System information

Steps to reproduce

  1. look into apache.conf for LogFormat containing %O
  2. look into awstatsMMYYYY-DOMAIN.TLD.txt datafile between BEGIN_DOMAIN and END_DOMAIN and comapre with values between BEGIN_TIME and END_TIME
  3. look into froxlors source which part of awstats data file is interpreted for traffic statistics

Expected behavior

  1. log sum of outgoing AND incoming web traffic (use %S instead of %O in apaches LogFormat)
  2. sum from "bandwidth" AND "not viewed bandwidth" from _TIME (instead "bandwith" from _DOMAIN) part in awstats data file

Actual behavior

  1. only outgoing traffic is logged by apache
  2. only traffic of known/defined http-codes and not from robots ("bandwidth" instead of "not viewed bandwidth") is counted by froxlor from awstats data file

AWStats splits the traffic data between the Viewed and non viewed traffic. AWStats's explanation on non viewed traffic is "Not viewed traffic includes traffic generated by robots, worms, or replies with special HTTP status codes."

rseffner commented 4 years ago

There are also _FILETYPES and _DOWNLOAD sections in awstats data file. While - for an example - in _FILETYPES in row PDF is a value of 12.536.762 the sum of all PDF files mentioned in _DOWNLOADS section is 53.230.834.

Sum of FILETYPES equals sum of _DOMAIN and equals sum of _TIME column "bandwidth". As we learned from awstats we have to add _TIME column "bandwidth not viewed" to catch traffic from robots, malware and with special HTML-return codes.

Another point seem to be to add also the sum of _DOWNLOAD section to get the WHOLE traffic (because it differs from _FILETYPES which equals _DOMAINS/_TIME-bandwidth).

Why there is no TOTAL line in awstats?

tobyX commented 3 years ago

I also stumbled over this because I wondered why my Nextcloud domain has very little traffic. That is because if you use WebDav this traffic wont show up in _DOMAIN, but in _LOGIN. I was not able to find a documentation about Awstats Data File, how to read it correctly? At the moment I'm thinking about reading the Apache Log directly for calculating traffic, I found a little Perl snippet which does it quite good and fast: cat access.log | perl -nE '/\[.+\] ".+" \d+ (\d+)/; $sum += $1; END {$sum = $sum / 1024 / 1024; printf("%.3f MB", $sum)}' Something like that in PHP for Froxlor should also work, I think.

At the moment Froxlors traffic calculation is totally unreliable and has many issues (Systemd rotates the logs normally before Froxlor can calculate them and the two problems mentioned here).

tobyX commented 3 years ago

I added a crude implementation of manual counting of traffic directly in the logfiles and logged this and what was found in Awstats and it differs widly, most of the times Awstats is only half of directly counting. But I suspect that the part about "BEGIN_TIME" counts every traffic. I will try to confirm this and if it is so then I will add a pull request to change the counting in Froxlor.

But I think a better idea would be to change the system totally. I made some major change to my Apache logs, for example I rotate every day and the rotated logs are postfixed with the date, which makes it very easy for everybody to find the correct logs. So my suggestion would be to do something like this also in Froxlor and then let Systemd rotate the logs at midnight and then we could to the calculation of http traffic at a later date (to relax load) and just look at the file from yesterday (and also assign the traffic to yesterday, currently the traffic from yesterday is written to DB with the date of today, which is confusing). Is there interest in such a big change?

d00p commented 3 years ago

well we decided years ago to let the traffic calculation handle projects that are made for that (webalizer, awstats). So the "main" problem here would be a wrong/incorrect transfer of webalizer/awstats values to the froxlor-database to display for admin/customers

tobyX commented 3 years ago

Ok, then I will check if my assumption is correct about the TIME section and if yes I will send a pull request with a fix.

d00p commented 3 years ago

any news on that @tobyX ?

tobyX commented 3 years ago

Sorry, I did let my code run for some months, but the numbers never added up at all and sometimes even where negative and I didnt found where the error is. And then other pressing issues came up... I will try to do it again and find out what went wrong.

d00p commented 3 years ago

So, I've just check on this a little deeper:

As we learned from awstats we have to add _TIME column "bandwidth not viewed" to catch traffic from robots, malware and with special HTML-return codes.

Wrong, the _TIME column only shows the viewed traffic, when adding up the values it's exactly the same as _DOMAIN

From what I've read, we need to add the viewed traffic and not viewed traffic - So I checked in the data file, the not viewed parts are ROBOT_, _WORMS and ERRORS_ but wenn adding these up, I get more than awstats shows for "not viewed traffic". Also when adding up _DOMAIN entries and dividing by 1024 - i'm still getting more KB than awstats shows for "viewed traffic" ...no idea where awstats gets these numbers from its own data-file...i might be missing something.

Any ideas?

d00p commented 2 years ago

We've integrated 'goaccess' into the next major version of froxlor which will also be the new default