Closed rtista closed 3 years ago
GET /api/auth/token
I'm guessing from a different protocol? e.g., GET /api/auth/token HTTP/1.0
vs GET /api/auth/token HTTP/1.1
vs GET /api/auth/token HTTP/2
, etc.
Let me know if that's the reason. Also, it would help if you could grep for those entries and see if you find something different. Feel free to post those log entries here.
@rtista... My bet, is that you have 2 option:
%H
in log-format, for parser HTTP request protocol. I.E HTTP/1.0
, HTTP/1.1
or HTTP/2.0
;--http-protocol=no
.Please, evaluate which option is best for you.
@rtista... My bet, is that you have 2 option:
- Add
%H
in log-format, for parser HTTP request protocol. I.EHTTP/1.0
,HTTP/1.1
orHTTP/2.0
;- Ignore request protocol with option
--http-protocol=no
.Please, evaluate which option is best for you.
I probably should have posted these before, but here it goes, these are the configurations I'm using :man_facepalming:
config-dialog false
hl-header true
json-pretty-print false
no-color false
no-column-names false
no-csv-summary false
no-progress true
no-tab-scroll false
no-parsing-spinner true
with-mouse false
no-global-config true
agent-list false
with-output-resolver false
http-method yes
http-protocol no
no-query-string true
no-term-resolver false
444-as-404 false
4xx-to-unique-count false
all-static-files true
double-decode false
ignore-crawlers false
crawlers-only false
ignore-panel REFERRERS
ignore-panel KEYPHRASES
real-os true
sort-panel REQUESTS,BY_CUMTS,DESC
static-file .css
static-file .js
static-file .jpg
static-file .png
static-file .gif
static-file .ico
static-file .jpeg
static-file .pdf
static-file .csv
static-file .mpeg
static-file .mpg
static-file .swf
static-file .woff
static-file .woff2
static-file .xls
static-file .xlsx
static-file .doc
static-file .docx
static-file .ppt
static-file .pptx
static-file .txt
static-file .zip
static-file .ogg
static-file .mp3
static-file .mp4
static-file .exe
static-file .iso
static-file .gz
static-file .rar
static-file .svg
static-file .bmp
static-file .tar
static-file .tgz
static-file .tiff
static-file .tif
static-file .ttf
static-file .flv
static-file .json
static-file .tpl
db-path /tmp/project/
persist true
restore true
keep-last 30
# Log configurations
time-format %T
date-format %d/%b/%Y
log_format %h - %^ [%d:%t %^] "%r" %s %b "%R" "%u" %T
Here's an example log entry in case I'm missing something in the log_format :smile:
$IP_ADDRESS - - [24/Sep/2020:00:21:53 +0100] "POST /api/common/ping HTTP/1.1" 200 51 "-" "Laminas_Http_Client" 0.000
Also:
@rtista, please, try to get last version from Github. I had similar problem that was reported in issue #1933. I believe the patch a6e1825 solved all.
GET /api/auth/token
I'm guessing from a different protocol? e.g.,GET /api/auth/token HTTP/1.0
vsGET /api/auth/token HTTP/1.1
vsGET /api/auth/token HTTP/2
, etc.Let me know if that's the reason. Also, it would help if you could grep for those entries and see if you find something different. Feel free to post those log entries here.
Also, at each of these changes I always erased the BTREE database directories :smile: The token request is not the only request for which this happens, there are 2/3 more from what I could see.
I executed the following search in a history of logs (about 60 days), and I only found these results:
# zgrep "GET /api/auth/token" nginx_access*.gz | sed -e "s/\?.* HTTP/ HTTP/" | cut -d "\"" -f 2 | sort | uniq -c
2 GET /api/auth/token HTTP/1.0
68631131 GET /api/auth/token HTTP/1.1
@rtista, please, try to get last version from Github. I had similar problem that was reported in issue #1933. I believe the patch a6e1825 solved all.
Thank you @0bi-w6n-K3nobi I cannot do it right now, but I'll test it out tomorrow and I'll comment the results here!
I've executed a simple test with the same configurations but with the latest master branch build. In my test, I used the exact same log file, and the same process, by which I mean, the zcat and sort to generate a new file, parsing this file with SED to remove ids, uids, etc with the word $var and then feed this file to goaccess. The following happened after executing the same process twice for the same log:
After first execution:
After second execution:
It seems as if the last timestamp based log processing is not working or am I doing something wrong?
I am also re-running the analysis for my history of logs so I can safely assume that the problem I first reported is fixed as it seems to be.
Are you able run it the same way but with a smaller dataset? Would you be able to attach that dataset to I can try to reproduce exactly what you are doing and how you are running goaccess? Thanks.
Hello @allinurl I tested it with a smaller test dataset and the latest master branch version and I found that the timestamp of the latest request does not seem to be respected.
Here's the test dataset:
10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/59123452634389 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.003
10.0.5.124 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/53453457649990 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.002
10.0.6.123 - - [21/Oct/2020:23:26:57 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.024
10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/34554492634389 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.002
10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/45345345634345 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.002
10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/53454572022944 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.5.5 curl/7.64.0 PHP/7.2.22" "-" "test.example.com" 0.002
10.12.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/5345476549334 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.5.5 curl/7.64.0 PHP/7.2.22" "-" "test.example.com" 0.003
10.13.5.123 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "GuzzleHttp/6.3.3 curl/7.29.0 PHP/7.0.33" "-" "test.example.com" 0.027
10.13.5.122 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "okhttp" "-" "test.example.com" 0.027
10.0.5.126 - - [21/Oct/2020:23:27:00 +0100] "GET /api/do/this/please/34534594140034 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.004
Running my process the first time, it generates the following:
The second time, all the values are doubled:
My process is the following:
# Sort by date into auxiliary file
zcat -f test.log | sort -k 4 > test-parsed.log
# Remove variations and unwanted HTTP methods
sed -E \
-e '/ "(OPTIONS|HEAD) \//d' \
-e '/ "(GET|POST|HEAD) \/ /d' \
-e '/\] "\\x/d' \
-e 's|/ HTTP| HTTP|' \
-e 's;/[0-9]+(/|\?| );/$var\1;g' \
-e 's;/([0-9]+[a-zA-Z\-]|[a-zA-Z\-]+[0-9])[a-zA-Z0-9\-]*(/|\?| );/$var\2;g' -i test-parsed.log
# Persist data in BTREEs
/usr/local/bin/goaccess --process-and-exit -p test.conf -o reports/test.html
# Restore and generate report
/usr/local/bin/goaccess -p test.conf -o reports/test.html
I'm using the following configuration:
config-dialog false
hl-header true
json-pretty-print false
no-color false
no-column-names false
no-csv-summary false
no-progress true
no-tab-scroll false
no-parsing-spinner true
with-mouse false
no-global-config true
agent-list false
with-output-resolver false
http-method yes
http-protocol no
no-query-string true
no-term-resolver false
444-as-404 false
4xx-to-unique-count false
all-static-files true
double-decode false
ignore-crawlers false
crawlers-only false
ignore-panel REFERRERS
ignore-panel KEYPHRASES
real-os true
sort-panel REQUESTS,BY_CUMTS,DESC
static-file .css
static-file .js
static-file .jpg
static-file .png
static-file .gif
static-file .ico
static-file .jpeg
static-file .pdf
static-file .csv
static-file .mpeg
static-file .mpg
static-file .swf
static-file .woff
static-file .woff2
static-file .xls
static-file .xlsx
static-file .doc
static-file .docx
static-file .ppt
static-file .pptx
static-file .txt
static-file .zip
static-file .ogg
static-file .mp3
static-file .mp4
static-file .exe
static-file .iso
static-file .gz
static-file .rar
static-file .svg
static-file .bmp
static-file .tar
static-file .tgz
static-file .tiff
static-file .tif
static-file .ttf
static-file .flv
static-file .json
static-file .tpl
db-path /tmp/project/
persist true
restore true
keep-last 1
# Log configurations
time-format %T
date-format %d/%b/%Y
log_format %h - %^ [%d:%t %^] "%r" %s %b "%R" "%u" "%^" "%v" %T
Please tell me if you have any doubt or need any assistance! And thank you very much for the help!
@rtista I must be missing something cause I keep getting 10 regardless of how many times I run
./goaccess test-parsed.log --process-and-exit -p test.conf -o reports/test.html
Could you please rebase again from upstream and make sure the output of goaccess -V
shows 1.4.1? You could try uninstalling the previous version by running:
$ cd goaccess/
$ ./configure --enable-utf8
# make uninstall
@allinurl I've tested it with a rebase from upstream and goaccess -V shows 1.4.1. I've uninstalled my previous version and installed it again as 1.4.1 and I'm still getting the same data duplication problem. I believe that I've found the problem however.
It seems that if I make any modification to an already processed file, it becomes a new and valid file for GoAccess, and so it processes all its data once again. I think this is it, because if I comment out the ZCAT and SED commands, GoAccess no longer reprocesses the file no matter how many times I execute it.
You can confirm it by using the following script:
# Read logfile from script arguments
LOGFILE=$1
# Check for log file
[[ -f $LOGFILE ]] && echo "Please provide a log file as first argument." && exit 1
# Do not override given log file
PARSED_LOG="parsed.log"
# Concatenate log files into a single file
zcat -f $LOGFILE | sort -k 4 > $PARSED_LOG
# Process log variations
sed -E \
-e '/ "(OPTIONS|HEAD) \//d' \
-e '/ "(GET|POST|HEAD) \/ /d' \
-e '/\] "\\x/d' \
-e 's|/ HTTP| HTTP|' \
-e 's;/[0-9]+(/|\?| );/$var\1;g' \
-e 's;/([0-9]+[a-zA-Z\-]|[a-zA-Z\-]+[0-9])[a-zA-Z0-9\-]*(/|\?| );/$var\2;g' -i $PARSED_LOG
# Import log data into BTREE databases
goaccess --process-and-exit -p test.conf $PARSED_LOG
# Generate HTML from BTREE databases
goaccess -p test.conf -o report.html
And executing it like so:
bash test.sh test.log
If you execute this script 2 times, for the same test.log (provided in previous comments), you should obtain a report with duplicated requests (20 in total). However, if you replace the ZCAT and SED related lines with PARSED_LOG=$LOG
you should never get duplicated requests in the report again.
That's right. When parsing a file, it checks the inode and since it's a new file, then it adds them as dups. Can you try to simply pipe the data directly into goaccess (note the additional -
instead of the file name)? Let me know if that works for you. Thanks again!
# Concatenate log files into a single file
zcat -f $LOGFILE | sort -k 4 | sed -E \
-e '/ "(OPTIONS|HEAD) \//d' \
-e '/ "(GET|POST|HEAD) \/ /d' \
-e '/\] "\\x/d' \
-e 's|/ HTTP| HTTP|' \
-e 's;/[0-9]+(/|\?| );/$var\1;g' \
-e 's;/([0-9]+[a-zA-Z\-]|[a-zA-Z\-]+[0-9])[a-zA-Z0-9\-]*(/|\?| );/$var\2;g' -i | goaccess - --process-and-exit -p test.conf
Yes, I believe now I can. That was my first attempt when developing this log analysis system, however, there was a huge RAM consumption when the data was being piped into GoAccess that would have it killed by OOM Killer. I believe however that you have fixed that problem as it no longer seems to occur.
However, now I'm facing a problem in which the whole log file is not processed. I'm processing a log of 692,571 lines and only 81,608 total requests are shown in the report. Any idea of what may be causing this? I even removed the SED and the SORT from the command but it still occurs.
Even if I test with test.log (the previously discussed one), only 4 requests appear in the Total Requests section. :cry:
Also the duplication problem seems to persist, every time I execute the analysis again for the same test.log, GoAccess adds three requests to the Total Requests section.
Hi @rtista.
Did you take a look at the --keep-last
parameter?
If I'm right, the default value is 7
.
Try it with zero, i.e. --keep-last=0
and see the result. See the manual for more details here.
Hello @0bi-w6n-K3nobi, for my last comment's tests, I've used a value of 10 for keep-last, however, the log file I'm working with contains entries of a single day (yesterday) and so it should not be due to the --keep-last parameter. The values for the test.log are of Oct 21 and so keep-last 10 should have no influence whatsoever.
hummm ... The problem must be too obvious, so we can't see. It seems stupid to ask you ... Are you sure that report.hml will be generated every time? Therefore, remove the report.html file first before running.
Yes, the report.html is clearly being generated else I would not see increasing numbers in total requests when reprocessing the same log.
Ok ... let's go very slowly this time.
First of all, it is important to highlight what @allinurl has already said: "When parsing a file, it checks the inode ..." If you create a new file every time it will be processed as if it were a new/other one every time.
On the other hand, if you use -
[ pipe
] as an input
then every file will be considered the same.
In this case it is equivalent to saying that all of them will be the same file with inode == 0
.
For that case, you might think it is the opposite of the 1st one.
For example, think you about the following case:
cat LOG-a | goaccess
-
...
where LOG-a
goes until 13 PM
.
cat LOG-b | goaccess
-
...
where LOG-b
starts at 08 AM
.
in this case, all LOG-b
records processed from star 08 AM
until 13 PM
will be ignored.
However, if you run:
goaccess
LOG-a
...
goaccess
LOG-b
...
GoAccess will count all records from both LOGs, differently from the case above.
And finally, if you do:
cat LOG-a > LOG-c
goaccess
LOG-c
...
GoAccess will duplicate LOG-a
entries, as LOG-c
is the same content but with a different inode.
This all happens because GoAccess uses inode + timestamp [date + time]
as a marker to not reprocess records from the same source.
That's all @rtista. I hope it was clear in my explanations.
@0bi-w6n-K3nobi the way you explained it, is just the way it worked in my mind. I just thought that the inode based strategy would also take into account the timestamp in order to provide a safer approach, but it doesn't and that's fine. My first problem is fixed by understanding the inode based approached does not take into account the timestamp if the inode does not match.
However a second problem arose from @allinurl test case. Using piping the test.log into GoAccess, generates a report of 4 Total Requests, when it should show 10 Total Requests. Besides this, if I execute the test.sh script multiple times, 3 more requests are added to the Total Requests which is impossible since it should be using the timestamp based approach. I'm leaving the details below so you can try and reproduce. I'm using the latest build from the master branch.
test.log:
10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/59123452634389 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.003
10.0.5.124 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/53453457649990 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.002
10.0.6.123 - - [21/Oct/2020:23:26:57 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.024
10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/34554492634389 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.002
10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/45345345634345 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.002
10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/53454572022944 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.5.5 curl/7.64.0 PHP/7.2.22" "-" "test.example.com" 0.002
10.12.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/5345476549334 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.5.5 curl/7.64.0 PHP/7.2.22" "-" "test.example.com" 0.003
10.13.5.123 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "GuzzleHttp/6.3.3 curl/7.29.0 PHP/7.0.33" "-" "test.example.com" 0.027
10.13.5.122 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "okhttp" "-" "test.example.com" 0.027
10.0.5.126 - - [21/Oct/2020:23:27:00 +0100] "GET /api/do/this/please/34534594140034 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.004
test.conf
config-dialog false
hl-header true
json-pretty-print false
no-color false
no-column-names false
no-csv-summary false
no-progress true
no-tab-scroll false
no-parsing-spinner true
with-mouse false
no-global-config true
agent-list false
with-output-resolver false
http-method yes
http-protocol no
no-query-string true
no-term-resolver false
444-as-404 false
4xx-to-unique-count false
all-static-files true
double-decode false
ignore-crawlers false
crawlers-only false
ignore-panel REFERRERS
ignore-panel KEYPHRASES
real-os true
sort-panel REQUESTS,BY_CUMTS,DESC
static-file .css
static-file .js
static-file .jpg
static-file .png
static-file .gif
static-file .ico
static-file .jpeg
static-file .pdf
static-file .csv
static-file .mpeg
static-file .mpg
static-file .swf
static-file .woff
static-file .woff2
static-file .xls
static-file .xlsx
static-file .doc
static-file .docx
static-file .ppt
static-file .pptx
static-file .txt
static-file .zip
static-file .ogg
static-file .mp3
static-file .mp4
static-file .exe
static-file .iso
static-file .gz
static-file .rar
static-file .svg
static-file .bmp
static-file .tar
static-file .tgz
static-file .tiff
static-file .tif
static-file .ttf
static-file .flv
static-file .json
static-file .tpl
persist true
restore true
time-format %T
date-format %d/%b/%Y
log_format %h - %^ [%d:%t %^] "%r" %s %b "%R" "%u" "%^" "%v" %T
db-path data/test/
keep-last 10
test.sh:
#!/bin/bash
[[ ! -d data/test ]] && mkdir -p data/test/
cat test.log | goaccess --process-and-exit -p test.conf -
goaccess -p test.conf -o test.html
@rtista Thanks for posting this. I'm looking into this and I'll post back asap.
Thank you very much @allinurl, take your time! I'm just trying to be of service in any way I'm able! ;)
Hi @allinurl.
The timestamp
strategy is the problem. The second record with same timestamp will be ever ignored.
Remind us that PIPE is same that a file with inode == 0
.
Of course, same processing will be correct if you disable --restore
and --persist
option.
We will have to think of something more elaborate.
Who knows, a simplified CRC of record + timestamp
for the case of the input is PIPE
? I.E inode == 0
.
See you soon.
@rtista @0bi-w6n-K3nobi That's right. The issue occurs with if (!glog->inode && lp.ts >= glog->lp.ts)
. When attempting to restore from disk, and no inode (a pipe), we compare timestamps but it's exclusive. This was implemented per this comment.
The timestamp comparison needs to be exclusive of the last time, as I would rather miss a few lines than double-count a few, but I don't feel very strongly about this.
We could change it to be inclusive and there's a risk of having dups,keeping it the way it it is, or like you said, we could get more elaborate with e.g., crc + ts. Thoughts?
By the way, the checksum will probably need to be of multiple lines since two or more consecutive requests could potentially be the exact same.
@allinurl and @0bi-w6n-K3nobi are you sure that that explains the behavior I described?
When I first import the dataset (test.log) via pipe, I'm getting 4 Total Requests analyzed, and these requests match the first request for each timestamp:
Requested Files Panel:
First request for each timestamp in the dataset (test.log):
10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/59123452634389 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.003
10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/45345345634345 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.002
10.13.5.123 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "GuzzleHttp/6.3.3 curl/7.29.0 PHP/7.0.33" "-" "test.example.com" 0.027
10.0.5.126 - - [21/Oct/2020:23:27:00 +0100] "GET /api/do/this/please/34534594140034 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.004
When I import the same dataset once more, the Total Requests is incremented by 3, resulting in the addition of the following requests:
These 3 requests correspond to the first for each timestamp (21/Oct/2020:23:26:58 +0100
, 21/Oct/2020:23:26:59 +0100
, 21/Oct/2020:23:27:00 +0100
) excluding the first timestamp (21/Oct/2020:23:26:57 +0100
), could we maybe be saving the timestamp of the first request instead of that from the last request?
Also, explain something to me, when we import from pipe (with --restore and --persist), we:
Or:
?
Because if I'm importing the dataset for the first time, all of the 10 dataset lines should be processed into the database and not only 4. I think what I mean is that the timestamp check should only check if the request timestamp is exclusively larger than the timestamp from the last request in the database and not check the timestamp between requests of the same piped data.
Good morning, @rtista. The same here!
For first, I got:
and for second:
Yeah. You are right, @allinurl. Checksum probably will have problem with consecutive requests that exactly the same. Soon after post above I concluded this too. And checksum will be add some overhead about request processing.
Thought of the marker something like:
timestamp + repeated timestamp count
but I think it can be so difficult to implement. I'll be more clear, using the LOG above, the marker would be:
timestamp1 + 1
-- 10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/59123452634389 ...
timestamp1 + 2
-- 10.0.5.124 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/53453457649990 ...
timestamp1 + 3
-- 10.0.6.123 - - [21/Oct/2020:23:26:57 +0100] "POST /api/dont/do/it ...
timestamp1 + 4
-- 10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/34554492634389 ...
timestamp2 + 1
-- 10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/45345345634345 ...
timestamp2 + 2
-- 10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/53454572022944 ...
timestamp2 + 3
-- 10.12.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/5345476549334 ...
timestamp3 + 1
-- 10.13.5.123 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it ...
timestamp3 + 2
-- 10.13.5.122 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it ...
timestamp4 + 1
-- 10.0.5.126 - - [21/Oct/2020:23:27:00 +0100] "GET /api/do/this/please/34534594140034 ...
In this algorithm, when we are processing the current file this makes it possible to process all records/requests. This is because we also considered counting the record with the same timestamp.
But, again, I think it can be difficult to be implemented. Because we will always have to calculate the number of repetitions for the current timestamp.
PS:
timestamp1 + 1
isn't a aritmetic add but a compose key with two values.
or... can be like a timestamp + 1/1000
in aritmetic add.
PS:
timestamp1 + 1
isn't a aritmetic add but a compose key with two values. or... can be like atimestamp + 1/1000
in aritmetic add.
Good morning @0bi-w6n-K3nobi :smile:
Why would we need to do this? Can't we just allow all timestamps that are larger than the one from the last request in the database?
Comparing the records of the same log between themselves creates the restriction that the log must be sorted by timestamp, which requires extra processing when concatenating several logs from the same application (generated in different machines). We only need to compare all the piped record's timestamp against the timestamp from the last inserted log record, as users just want to reprocess the same log without adding duplicates.
If the problem is not in the comparison, but in the inserting into the database, we could probably generate a key of the timestamp and the piped data iteration (the index of the record in the file). That would already provide enough entropy I believe without becoming too complex, albeit being a little under-engineered.
@rista, good afternoon.
I agreed about restriction of processing records of the same log.
But, about database it's more complicated.
The implemented solution aims at not reprocessing records from the same LOG
.
And in particular any problem that exists when it is rotated [ex. logrotated
].
Therefore, it is impossible/impracticable use the index of the record in the file.
In addition, we will have to find a solution together with @allinurl.
Hi, @allinurl
Perhaps now is a good time to implement --ignore-timestamp
option, like a workaround for @rtista process her LOGs.
And soon elaborate a more complete/useful solution for marker with timestamp
.
See you soon.
@0bi-w6n-K3nobi and @allinurl , ignoring the timestamp is not entirely what I want or need, there are two distinct problems here:
From my understanding the solutions for the problems should be:
Hi @rtista,
your conclusions/observations are very common sense.
Hi @allinurl,
I believe that we can use the ideas of @rtista, mainly for PIPE input. In the case of common LOG, it would probably be necessary to take into account what has already been implemented in patch 2e14dc3.
We can then summarize these ideas as:
a) with the --restore
option, load the value of the last marker timestamp + inode
BD_TS
for each LOG
and PIPE
;
b) when processing LOG
/PIPE
use the global value BD_TS
, and not the value of the last record;
c) with the --persist
option, save the highest value of timestamp
in the BD_TS
field when finishing the processing.
These conditions, which I think should be implemented at least for the PIPE
input, would allow/accomplish:
LOGs
, either from different VHOSTs
or from different servers.BD_TS
value restriction applies only to the initial timestamp
of the records and not to the order or final value of these timestamp
.timestamp
with each other in the same LOG
/PIPE
.These conclusions lead to a consideration or consequence: the user becomes more responsible for what he is doing. For example, should be correct if:
LOGs
of the same day as a PIPE
input.But, all of them leads to greater freedom and not to ignore requests that were before.
See you soon.
@rtista Thanks for sending those ideas. I think the issue right now, at least for the initial parse is the use of --restore
, that creates the conflict as you are not restoring anything at that point. Thus, I believe @0bi-w6n-K3nobi was suggesting the --ignore-timestamp
?
@0bi-w6n-K3nobi can I ask what does BD_TS
mean here?
Hi @allinurl.
BD_TS
is value of maker timestamp + inode
that already has been save BTree Hash on disk with a option --persist
.
I used this term only be more clear, that it must be used for compare with timestamp
of current request and not the timestamp
from previous record.
I tried to compile the ideas from @rtista above.
I think the issue right now, at least for the initial parse is the use of --restore, that creates the conflict as you are not restoring anything at that point.
@allinurl I can confirm this, without the --restore option the first import does indeed import all of the 10 requests.
The second import however keeps importing only the first request for each timestamp which is larger than the first request timestamp and not larger than the last request timestamp.
@0bi-w6n-K3nobi I'm trying to understand a bit more your post. Could you explain what would be the purpose of the global timestamp + inode value?
Hi @allinurl.
I will try. So I will elaborate and post here soon as possible.
Hi @allinurl.
Well... When I told global
it is just a way of expressing myself. The correct term would be per session
/running
.
So, come on.
When GOACCESS
runs with the--restore
option and the PIPE
input -- our main difficulty -- I believe the behaviour should be:
timestamp + inode
-- where PIPE
is equal to say that inode == 0
-- ok, it's ready;load
-- and not timestamp
value of the previous record;--persist
option, when finish the processing than save the highest value found of timestamp
of current PIPE
/LOG
into disk, into maker field.I call this process with nickname of per session
/running
maker.
As I told before, lead to some consequences:
timestamp
of requests that will be processed, and not the order or highest value of these timestamp
, and ..PIPE
input without problems -- i.e all requests will be computed;From physical LOG
[not PIPE
], I believe yet will be need add some complex rules like your patch 2e14dc3.
According to the reasons that led to this patch.
I hope than was more clearer. Otherwise give me feedback on this. I see you soon.
Well ... @allinurl it seems pretty easy.
Just move the call to ht_insert_last_parse (0, glog-> lp)
, per file and PIPE, to the end when saving to disk.
And not by each processed request. This is the challenge.
I see you soon.
Hey guys, just to say that I agree with @0bi-w6n-K3nobi 's implementation, it seems to solve my issues and require less pre-processing when feeding piped data from several files, as this data no longer needs to be previously ordered by timestamp (a very time-consuming process for high amounts of logs) only to be larger than the last persisted timestamp.
@rtista glad to hear that helped. @0bi-w6n-K3nobi, sorry, I guess I'm not seeing the fix, could you please elaborate a bit more how moving ht_insert_last_parse (0, glog-> lp)
to the end (of?) will address this issue when piping data in? Thanks
By the way, just to be clear, I think the current implementation works fine when processing a log. I think the issue we're having is when feeding from a pipe since there's no last line or size of the pipe.
Hi @allinurl.
Let's take a closer look at the code below, into parse.c
:
static int
should_restore_from_disk (GLog * glog) {
GLastParse lp = { 0 };
if (!conf.restore)
return 0;
lp = ht_get_last_parse (glog->inode); ** 1 **
/* restoring from disk, and no inode (probably a pipe), compare timestamps then
* (exclusive) */
if (!glog->inode && lp.ts >= glog->lp.ts) ** 2 **
return 1;
...
static int
pre_process_log (GLog * glog, char *line, int dry_run) {
...
glog->lp.ts = mktime (&logitem->dt); ** 3 **
if (should_restore_from_disk (glog))
goto cleanup;
count_process (glog);
...
cleanup:
free_glog (logitem);
ht_insert_last_parse (0, glog->lp); ** 4 **
return ret;
}
...
static int
read_log (GLog ** glog, const char *fn, int dry_run) {
...
/* insert the inode of the file parsed and the last line parsed */
if ((*glog)->inode) {
(*glog)->lp.line = (*glog)->read;
ht_insert_last_parse ((*glog)->inode, (*glog)->lp); ** 5 **
}
As you noted, this code works well for normal LOG
. I.E. is not PIPE
.
The first observation/problem is with this code is the line 4 . Remembering here that first param is inode
.
This problem is because then all LOGs
and even the PIPE
input will update last_parse
with entry 0
in HASH
, for each request processed. But the entry 0
is matches PIPE
input too.
The line 1 will retrieve last_parse
with inode
entry in HASH
. And if it is PIPE
than it will be with entry0
.
Oops ... but at 4 the entry 0
was saved for each record.
So, if we are processing the PIPE
input -- The line 2 will only process requests with timestamp
greater than last_parse
loaded on the line 1 . But this value has been updated for each record in line 4 . The problem is HERE!
To process per session
, as I mentioned above, we have to:
last_parse
global from last_parse
PIPE
, i.e. entry == 0
-- i.e change line 4 ;last_parse
PIPE
at the same point as if it were a physical LOG
-- i.e. include in the line 5 .In other words, the code would be:
...
ht_insert_last_parse (UINT_MAX, glog->lp); ** 4 **
...
-- if ((*glog)->inode) { // remove this for include PIPE saving too.
(*glog)->lp.line = (*glog)->read;
ht_insert_last_parse ((*glog)->inode, (*glog)->lp); ** 5 **
-- }
This will ensure us that value of lp.ts
in line 2 was loaded from the previous session, and not from the previous record.
Therefore, applying this modification, the requests for the PIPE
input will be only ignored if the timestamp
is less than the previous session -- saved in last_parse
HASH
entry 0
.
Now, is it clearer? Ok, there may be some side effect that I haven't thought of.
See you soon.
Hi @allinurl. Good-morning.
Soon after post above, I thought to myself... What is the purpose of the line 4 ? What did you think to solve with it? I talk about it right after proposing the modification above. And therefore, the line 4 will definitely ask for its meaning.
However, I think it is even more complex to implement the highest timestamp
value for saving HASH
to disk.
This is what we want to implement for the PIPE
input.
Here, I will try to describe the process that should be implemented.
We can return here the old idea of the flag --ignore-timestamp
.
But now, I think of something more elaborate and the most appropriate name would be flag --ordered-timestamp
.
For this new flag, we would have the associated variable conf.ordered_timestamp
, under the following conditions:
ordered-timestamp
= yes
-- conf.ordered_timestamp
= 1
;ordered-timestamp
= no
-- conf.ordered_timestamp
= 0
;ordered-timestamp
unset
-- conf.ordered_timestamp
= -1
;And we can assume if ordered-timestamp
is unset, i.e. conf.ordered_timestamp == -1
,
the default behaviour for program due input type:
LOG
input -- ordered-timestamp
= yes
;PIPE
input -- ordered-timestamp
= no
;Then we can make the following changes to the code:
static int
should_restore_from_disk (GLog * glog) {
GLastParse lp = { 0 };
...
if (!glog->inode) {
if(conf.ordered_timestamp == 1 && lp.ts > glog->lp.ts)
return 1;
else
return 0;
}
...
if (conf.ordered_timestamp == 0 || glog->lp.ts >= lp.ts)
return 0;
...
else if (glog->size < lp.size && (conf.ordered_timestamp == 0 || glog->lp.ts == lp.ts))
return 0;
...
else if (glog->size > lp.size && glog->read >= lp.line && (conf.ordered_timestamp == 0 || glog->lp.ts == lp.ts))
return 0;
Here, one observations to make -- please, pay attention for symbols >
and >=
:
I change the comparasions tolp.ts
>
glog->lp.ts
and glog->lp.ts
>=
lp.ts
.
I believe that it is better to include requests with the same timestamp
than to ignore them. Keep in mind a server that has many requests in the same second.
Continuing the changes, for find and save highest timestamp
value ...
static int
pre_process_log (GLog * glog, char *line, int dry_run) {
...
cleanup:
free_glog (logitem);
session_lp = ht_get_last_parse_session (glog->inode);
if (session_lp.ts < glog->lp.ts) {
session_lp.ts = glog->lp.ts
ht_insert_last_parse_session (glog->inode, session_lp);
}
...
static int
read_log (GLog ** glog, const char *fn, int dry_run) {
...
if ((*glog)->inode) {
session_lp = ht_get_last_parse_session (glog->inode);
(*glog)->lp.line = (*glog)->read;
(*glog)->lp.ts = session_lp.ts;
ht_insert_last_parse ((*glog)->inode, (*glog)->lp);
} else {
session_lp = ht_get_last_parse_session (0);
(*glog)->lp.line = 0; /* Ignore # of lines ??? */
(*glog)->lp.ts = session_lp.ts;
ht_insert_last_parse (0, (*glog)->lp);
}
If you don't want to build a new HASH
function, you can use the trick:
#define ht_insert_last_parse_session (_inode, _lp) ht_insert_last_parse (INT_MAX +_inode, _lp)
#define ht_get_last_parse_session (_inode) ht_get_last_parse (INT_MAX + _inode)
But, do not forget to delete these values before saving to disk.
Now, I think I have addressed the whole problem. Then that's it.
-------
PS : Just a note -------
An observation - through an ex. - off course: What would be the behaviour of program due the line 5 , if in the 1st execution we do:
goaccess --restore --persist log-a log-b
and in the 2nd run then:
goaccess --restore --persist log-b log-a
We are assuming that LOGs
are static, unchanged. Oops...
OK. Everything is fine if you run GOACCES
with the LOGs
ever in the same order, or single LOG
.
Doing something more elaborate, regardless of the order of the LOGs
, will have to change the logic of the whole program.
Then, it use line delta [# line from individual LOG] instead of the current global # line.
-------
End a note -------
Phew! See you soon.
Hi @allinurl.
For the note above, if you want to develop/think about...
We would have to use individual values for each LOG
, instead of the global value of glog->size
and glob->read
.
Perhaps, you can use 2 more variables in the GLog
structure, since the program processes one LOG
at a time.
And also use a session HASH
function and save the value at the end, in read_log
.
See you soon.
Hi @0bi-w6n-K3nobi Thanks for all the details on this. First of all, are we trying to fix the issue where some duplicate values are not counted because the last timestamp is equal to the currently read timestamp (chronologically ordered data, e.g., 29/Oct/2020, 30/May/2020, 31/Oct/2020) or are we addressing the issue where some values are not counted because the pipe or log contains entries that are not chronologically ordered, e.g., 31/Oct/2020, 10/May/2020, 30/Oct/2020?
Also, I want to ask you, you mentioned the following:
So, if we are processing the PIPE input -- The line 2 will only process requests with timestamp greater than last_parse loaded on the line 1 . But this value has been updated for each record in line 4 . The problem is HERE!
True, the value timestamp is updated for each record as you noted, but I still fail to see how moving it to read_log()
would solve the issue. glog->lp
would still contain the last parsed value (highest timestamp, assuming chronologically ordered data).
Sorry, I may be missing something from your post.
Hi @allinurl
Yeah. You are right.
If we are processing only with PIPE
input, the code led to this.
In post, I tried give a a more complete and complex solution.
Yeah. I will try debug and understand what the problem with this case, in particular. So, I will add some necessary rectifications/notes.
Thanks to show me what was the exact point/problem.
See you soon.
Hi @allinurl.
After thinking a little about the code, I think I can explain why it does not have the expected behaviour.
Yeah, I didn't have to do DEBUG
for that. ;)
First, let's use a visual resource for this. Then, we can run the program with the following call:
cat test.log | awk '{$7 = NR""$7; print $0}' |goaccess -p test.conf -o test.html
If you prefer, you can run in 2 steps as described above. The results are the same in both cases. If we do this we see exactly what requests are processed.
So, come on. Look you carefully at the post with the source code and line numbering. The first time we run the program we will have, for processing requests:
# load hash TS/line ** 1 ** Valid/Processed save hash ts/line ** 4 **
1 ---/0 yes 21/Oct/2020:23:26:57
2 21/Oct/2020:23:26:57 no ...:57
3 ...:57 no ...:57
4 ...:57 no ...:57
5 ...:57 yes 21/Oct/2020:23:26:58
6 21/Oct/2020:23:26:58 no ...:58
7 ...:58 no ...:58
8 ...:58 yes 21/Oct/2020:23:26:59
9 21/Oct/2020:23:26:59 no ...:59
10 ...:59 yes 21/Oct/2020:23:27:00
This is a simple execution table, request after request, with the values of TIMESTAMP
loaded in line ** 2 *, and saved in line 4 .
Remembering that the HASH
on disk is empty when starting this 1st execution. And that the TIMESTAMP
saved in line 4 ** is the one of the request itself.
You can check the result in the test.html
file. The Requested Files (URLs)
panel will show the requests # 1, 5, 8 and 10.
So far the behaviour is exactly what you are expecting from it.
Is it clear so far? Now, suspense ... What will happen? :)
Let's go to processing, running the 2nd time:
# load hash TS/line ** 1 ** Valid/Processed save hash ts/line ** 4 **
1 21/Oct/2020:23:27:00 no 21/Oct/2020:23:26:57
2 21/Oct/2020:23:26:57 no ...:57
3 ...:57 no ...:57
4 ...:57 no ...:57
5 ...:57 yes 21/Oct/2020:23:26:58
6 21/Oct/2020:23:26:58 no ...:58
7 ...:58 no ...:58
8 ...:58 yes 21/Oct/2020:23:26:59
9 21/Oct/2020:23:26:59 no ...:59
10 ...:59 yes 21/Oct/2020:23:27:00
Again, looking at the test.html
file and its panel, we will see: 2 x requets # 5, 8 and 10 and 1x to # 1.
It's clear?
It is clear that the 4 line is the problem.
Transferring it to read_log ()
will prevent this behavior.
Then, the TIMESTAMP
loaded in line 2 will always be the value of the previous execution, from saved Hash
on disk, and not the previous request.
I hope I have now clarified the problem. I see you soon.
PS: What are we addressing with?
The issue where some values are not counted because the PIPE
-- moving line 4 to read_log()
.
The issue where log contains entries that are not chronologically ordered -- creatting the flag --ordered-timestamp
with default behaviour [flag unset] due to input type pyshical LOG
or PIPE
.
Patch applied that solves the PIPE issue with missing count and restoring from disk. Please let me know how it goes by building from upstream. I'll be releasing v1.4.1 pretty soon. Thanks again!
Hello! I'm currently trying to automate the report generation for a project which contains several deployed nodes, these nodes write to different logs, and I'm generating the report for those logs. Each of these logs relates to a day of accesses and they are rotated daily around 3AM.
My script receives several log files and concatenates them together, sorts them by the date field, and then using SED removes variation fields, such as IDs, Hashes, and others from the requests URL replacing them with the word "$var". A single log file is generated from this process which I then feed to GoAcess, first persisting the data and then updating the report HTML with data from the last log file.
The problem is that, after loading 4/5 days of logs, in the "Requested Files" panel, the behavior in the following image happens:
As you can see, there are several "GET /api/auth/token" when only one should be appearing. Does anyone know why this happen? Is it a bug or am I doing something wrong?