Duplicated request entries on "Requested Files" panel

rtista commented 3 years ago

Hello! I'm currently trying to automate the report generation for a project which contains several deployed nodes, these nodes write to different logs, and I'm generating the report for those logs. Each of these logs relates to a day of accesses and they are rotated daily around 3AM.

# Concatenate log files into a single file
zcat -f $LOGS | sort -k $SORT > $PARSED_LOG

# Process log variations
bash $BASEDIR/process-log-variations.sh $PARSED_LOG

# Import log data into BTREE databases
goaccess --process-and-exit -p $conf $PARSED_LOG

# Generate HTML from BTREE databases
goaccess -p $conf -o $REPDIR/report.html

My script receives several log files and concatenates them together, sorts them by the date field, and then using SED removes variation fields, such as IDs, Hashes, and others from the requests URL replacing them with the word "$var". A single log file is generated from this process which I then feed to GoAcess, first persisting the data and then updating the report HTML with data from the last log file.

The problem is that, after loading 4/5 days of logs, in the "Requested Files" panel, the behavior in the following image happens:

As you can see, there are several "GET /api/auth/token" when only one should be appearing. Does anyone know why this happen? Is it a bug or am I doing something wrong?

allinurl commented 3 years ago

GET /api/auth/token I'm guessing from a different protocol? e.g., GET /api/auth/token HTTP/1.0 vs GET /api/auth/token HTTP/1.1 vs GET /api/auth/token HTTP/2, etc.

Let me know if that's the reason. Also, it would help if you could grep for those entries and see if you find something different. Feel free to post those log entries here.

0bi-w6n-K3nobi commented 3 years ago

@rtista... My bet, is that you have 2 option:

Add %H in log-format, for parser HTTP request protocol. I.E HTTP/1.0, HTTP/1.1 or HTTP/2.0;
Ignore request protocol with option --http-protocol=no.

Please, evaluate which option is best for you.

rtista commented 3 years ago

@rtista... My bet, is that you have 2 option:

Add %H in log-format, for parser HTTP request protocol. I.E HTTP/1.0, HTTP/1.1 or HTTP/2.0;

Ignore request protocol with option --http-protocol=no.

Please, evaluate which option is best for you.

I probably should have posted these before, but here it goes, these are the configurations I'm using :man_facepalming:

config-dialog false
hl-header true
json-pretty-print false
no-color false
no-column-names false
no-csv-summary false
no-progress true
no-tab-scroll false
no-parsing-spinner true
with-mouse false
no-global-config true
agent-list false
with-output-resolver false
http-method yes
http-protocol no
no-query-string true
no-term-resolver false
444-as-404 false
4xx-to-unique-count false
all-static-files true
double-decode false
ignore-crawlers false
crawlers-only false
ignore-panel REFERRERS
ignore-panel KEYPHRASES
real-os true
sort-panel REQUESTS,BY_CUMTS,DESC
static-file .css
static-file .js
static-file .jpg
static-file .png
static-file .gif
static-file .ico
static-file .jpeg
static-file .pdf
static-file .csv
static-file .mpeg
static-file .mpg
static-file .swf
static-file .woff
static-file .woff2
static-file .xls
static-file .xlsx
static-file .doc
static-file .docx
static-file .ppt
static-file .pptx
static-file .txt
static-file .zip
static-file .ogg
static-file .mp3
static-file .mp4
static-file .exe
static-file .iso
static-file .gz
static-file .rar
static-file .svg
static-file .bmp
static-file .tar
static-file .tgz
static-file .tiff
static-file .tif
static-file .ttf
static-file .flv
static-file .json
static-file .tpl
db-path /tmp/project/
persist true
restore true
keep-last 30

# Log configurations
time-format %T
date-format %d/%b/%Y
log_format %h - %^ [%d:%t %^] "%r" %s %b "%R" "%u" %T

Here's an example log entry in case I'm missing something in the log_format :smile: $IP_ADDRESS - - [24/Sep/2020:00:21:53 +0100] "POST /api/common/ping HTTP/1.1" 200 51 "-" "Laminas_Http_Client" 0.000

Also:

The log_format was generated from the nginx log_format directive using the nginx2goacces.sh script.
I'm using persist and restore to incrementally add data to the in-disk BTREE databases.
I've already disabled http-protocol as I though it might have been about that, but still the problem persist. I am doing it via configuration file however, not sure if behavior may differ.

0bi-w6n-K3nobi commented 3 years ago

@rtista, please, try to get last version from Github. I had similar problem that was reported in issue #1933. I believe the patch a6e1825 solved all.

rtista commented 3 years ago

GET /api/auth/token I'm guessing from a different protocol? e.g., GET /api/auth/token HTTP/1.0 vs GET /api/auth/token HTTP/1.1 vs GET /api/auth/token HTTP/2, etc.

Let me know if that's the reason. Also, it would help if you could grep for those entries and see if you find something different. Feel free to post those log entries here.

I first thought it could be something relative to the HTTP protocol, and so i disabled it in GoAccess, but the problem persisted.
I then replaced, using SED, all entries in the log to HTTP/1.1 so that wouldn't happen, I fed the log to GoAccess and it still happened.
From there I went to check the logs, using grep and cat -v, but the only difference I found was that some had the query string. I disabled it in GoAccess and the problem persisted. I then used SED to get rid of it and then processed the log with GoAccess and the problem still persisted.

Also, at each of these changes I always erased the BTREE database directories :smile: The token request is not the only request for which this happens, there are 2/3 more from what I could see.

I executed the following search in a history of logs (about 60 days), and I only found these results:

# zgrep "GET /api/auth/token" nginx_access*.gz | sed -e "s/\?.* HTTP/ HTTP/" | cut -d "\"" -f 2 | sort | uniq -c
      2 GET /api/auth/token HTTP/1.0
68631131 GET /api/auth/token HTTP/1.1

rtista commented 3 years ago

@rtista, please, try to get last version from Github. I had similar problem that was reported in issue #1933. I believe the patch a6e1825 solved all.

Thank you @0bi-w6n-K3nobi I cannot do it right now, but I'll test it out tomorrow and I'll comment the results here!

rtista commented 3 years ago

I've executed a simple test with the same configurations but with the latest master branch build. In my test, I used the exact same log file, and the same process, by which I mean, the zcat and sort to generate a new file, parsing this file with SED to remove ids, uids, etc with the word $var and then feed this file to goaccess. The following happened after executing the same process twice for the same log:

The global counters were updated to the double:

After first execution:

After second execution:

The "Requested Files" panel does not show duplicated requests, however, it shows the double of the request count per request.

It seems as if the last timestamp based log processing is not working or am I doing something wrong?

I am also re-running the analysis for my history of logs so I can safely assume that the problem I first reported is fixed as it seems to be.

allinurl commented 3 years ago

Are you able run it the same way but with a smaller dataset? Would you be able to attach that dataset to I can try to reproduce exactly what you are doing and how you are running goaccess? Thanks.

rtista commented 3 years ago

Hello @allinurl I tested it with a smaller test dataset and the latest master branch version and I found that the timestamp of the latest request does not seem to be respected.

Here's the test dataset:

10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/59123452634389 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.003
10.0.5.124 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/53453457649990 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.002
10.0.6.123 - - [21/Oct/2020:23:26:57 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.024
10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/34554492634389 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.002
10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/45345345634345 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.002
10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/53454572022944 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.5.5 curl/7.64.0 PHP/7.2.22" "-" "test.example.com" 0.002
10.12.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/5345476549334 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.5.5 curl/7.64.0 PHP/7.2.22" "-" "test.example.com" 0.003
10.13.5.123 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "GuzzleHttp/6.3.3 curl/7.29.0 PHP/7.0.33" "-" "test.example.com" 0.027
10.13.5.122 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "okhttp" "-" "test.example.com" 0.027
10.0.5.126 - - [21/Oct/2020:23:27:00 +0100] "GET /api/do/this/please/34534594140034 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.004

Running my process the first time, it generates the following:

The second time, all the values are doubled:

My process is the following:

# Sort by date into auxiliary file
zcat -f test.log | sort -k 4 > test-parsed.log

# Remove variations and unwanted HTTP methods
sed -E \
    -e '/ "(OPTIONS|HEAD) \//d' \
    -e '/ "(GET|POST|HEAD) \/ /d' \
    -e '/\] "\\x/d' \
    -e 's|/ HTTP| HTTP|' \
    -e 's;/[0-9]+(/|\?| );/$var\1;g' \
    -e 's;/([0-9]+[a-zA-Z\-]|[a-zA-Z\-]+[0-9])[a-zA-Z0-9\-]*(/|\?| );/$var\2;g' -i test-parsed.log

# Persist data in BTREEs
/usr/local/bin/goaccess --process-and-exit -p test.conf -o reports/test.html

# Restore and generate report
/usr/local/bin/goaccess -p test.conf -o reports/test.html

I'm using the following configuration:

config-dialog false
hl-header true
json-pretty-print false
no-color false
no-column-names false
no-csv-summary false
no-progress true
no-tab-scroll false
no-parsing-spinner true
with-mouse false
no-global-config true
agent-list false
with-output-resolver false
http-method yes
http-protocol no
no-query-string true
no-term-resolver false
444-as-404 false
4xx-to-unique-count false
all-static-files true
double-decode false
ignore-crawlers false
crawlers-only false
ignore-panel REFERRERS
ignore-panel KEYPHRASES
real-os true
sort-panel REQUESTS,BY_CUMTS,DESC
static-file .css
static-file .js
static-file .jpg
static-file .png
static-file .gif
static-file .ico
static-file .jpeg
static-file .pdf
static-file .csv
static-file .mpeg
static-file .mpg
static-file .swf
static-file .woff
static-file .woff2
static-file .xls
static-file .xlsx
static-file .doc
static-file .docx
static-file .ppt
static-file .pptx
static-file .txt
static-file .zip
static-file .ogg
static-file .mp3
static-file .mp4
static-file .exe
static-file .iso
static-file .gz
static-file .rar
static-file .svg
static-file .bmp
static-file .tar
static-file .tgz
static-file .tiff
static-file .tif
static-file .ttf
static-file .flv
static-file .json
static-file .tpl
db-path /tmp/project/
persist true
restore true
keep-last 1

# Log configurations
time-format %T
date-format %d/%b/%Y
log_format %h - %^ [%d:%t %^] "%r" %s %b "%R" "%u" "%^" "%v" %T

Please tell me if you have any doubt or need any assistance! And thank you very much for the help!

allinurl commented 3 years ago

@rtista I must be missing something cause I keep getting 10 regardless of how many times I run

./goaccess test-parsed.log --process-and-exit -p test.conf -o reports/test.html

2020-10-22

Could you please rebase again from upstream and make sure the output of goaccess -V shows 1.4.1? You could try uninstalling the previous version by running:

$ cd goaccess/
$ ./configure --enable-utf8
# make uninstall

rtista commented 3 years ago

@allinurl I've tested it with a rebase from upstream and goaccess -V shows 1.4.1. I've uninstalled my previous version and installed it again as 1.4.1 and I'm still getting the same data duplication problem. I believe that I've found the problem however.

It seems that if I make any modification to an already processed file, it becomes a new and valid file for GoAccess, and so it processes all its data once again. I think this is it, because if I comment out the ZCAT and SED commands, GoAccess no longer reprocesses the file no matter how many times I execute it.

You can confirm it by using the following script:

# Read logfile from script arguments
LOGFILE=$1

# Check for log file
[[ -f $LOGFILE ]] && echo "Please provide a log file as first argument." && exit 1

# Do not override given log file
PARSED_LOG="parsed.log"

# Concatenate log files into a single file
zcat -f $LOGFILE | sort -k 4 > $PARSED_LOG

# Process log variations
sed -E \
    -e '/ "(OPTIONS|HEAD) \//d' \
    -e '/ "(GET|POST|HEAD) \/ /d' \
    -e '/\] "\\x/d' \
    -e 's|/ HTTP| HTTP|' \
    -e 's;/[0-9]+(/|\?| );/$var\1;g' \
    -e 's;/([0-9]+[a-zA-Z\-]|[a-zA-Z\-]+[0-9])[a-zA-Z0-9\-]*(/|\?| );/$var\2;g' -i $PARSED_LOG

# Import log data into BTREE databases
goaccess --process-and-exit -p test.conf $PARSED_LOG

# Generate HTML from BTREE databases
goaccess -p test.conf -o report.html

And executing it like so:

bash test.sh test.log

If you execute this script 2 times, for the same test.log (provided in previous comments), you should obtain a report with duplicated requests (20 in total). However, if you replace the ZCAT and SED related lines with PARSED_LOG=$LOG you should never get duplicated requests in the report again.

allinurl commented 3 years ago

That's right. When parsing a file, it checks the inode and since it's a new file, then it adds them as dups. Can you try to simply pipe the data directly into goaccess (note the additional - instead of the file name)? Let me know if that works for you. Thanks again!


# Concatenate log files into a single file
zcat -f $LOGFILE | sort -k 4 | sed -E \
    -e '/ "(OPTIONS|HEAD) \//d' \
    -e '/ "(GET|POST|HEAD) \/ /d' \
    -e '/\] "\\x/d' \
    -e 's|/ HTTP| HTTP|' \
    -e 's;/[0-9]+(/|\?| );/$var\1;g' \
    -e 's;/([0-9]+[a-zA-Z\-]|[a-zA-Z\-]+[0-9])[a-zA-Z0-9\-]*(/|\?| );/$var\2;g' -i | goaccess - --process-and-exit -p test.conf

rtista commented 3 years ago

Yes, I believe now I can. That was my first attempt when developing this log analysis system, however, there was a huge RAM consumption when the data was being piped into GoAccess that would have it killed by OOM Killer. I believe however that you have fixed that problem as it no longer seems to occur.

However, now I'm facing a problem in which the whole log file is not processed. I'm processing a log of 692,571 lines and only 81,608 total requests are shown in the report. Any idea of what may be causing this? I even removed the SED and the SORT from the command but it still occurs.

Even if I test with test.log (the previously discussed one), only 4 requests appear in the Total Requests section. :cry:

Also the duplication problem seems to persist, every time I execute the analysis again for the same test.log, GoAccess adds three requests to the Total Requests section.

0bi-w6n-K3nobi commented 3 years ago

Hi @rtista.

Did you take a look at the --keep-last parameter? If I'm right, the default value is 7. Try it with zero, i.e. --keep-last=0 and see the result. See the manual for more details here.

rtista commented 3 years ago

Hello @0bi-w6n-K3nobi, for my last comment's tests, I've used a value of 10 for keep-last, however, the log file I'm working with contains entries of a single day (yesterday) and so it should not be due to the --keep-last parameter. The values for the test.log are of Oct 21 and so keep-last 10 should have no influence whatsoever.

0bi-w6n-K3nobi commented 3 years ago

hummm ... The problem must be too obvious, so we can't see. It seems stupid to ask you ... Are you sure that report.hml will be generated every time? Therefore, remove the report.html file first before running.

rtista commented 3 years ago

Yes, the report.html is clearly being generated else I would not see increasing numbers in total requests when reprocessing the same log.

0bi-w6n-K3nobi commented 3 years ago

Ok ... let's go very slowly this time.

First of all, it is important to highlight what @allinurl has already said: "When parsing a file, it checks the inode ..." If you create a new file every time it will be processed as if it were a new/other one every time.

On the other hand, if you use - [ pipe ] as an input then every file will be considered the same. In this case it is equivalent to saying that all of them will be the same file with inode == 0. For that case, you might think it is the opposite of the 1st one.

For example, think you about the following case: cat LOG-a | goaccess-... where LOG-a goes until 13 PM. cat LOG-b | goaccess-... where LOG-b starts at 08 AM. in this case, all LOG-b records processed from star 08 AM until 13 PM will be ignored.

However, if you run: goaccessLOG-a... goaccessLOG-b... GoAccess will count all records from both LOGs, differently from the case above.

And finally, if you do: cat LOG-a > LOG-c goaccessLOG-c... GoAccess will duplicate LOG-a entries, as LOG-c is the same content but with a different inode.

This all happens because GoAccess uses inode + timestamp [date + time] as a marker to not reprocess records from the same source.

That's all @rtista. I hope it was clear in my explanations.

rtista commented 3 years ago

@0bi-w6n-K3nobi the way you explained it, is just the way it worked in my mind. I just thought that the inode based strategy would also take into account the timestamp in order to provide a safer approach, but it doesn't and that's fine. My first problem is fixed by understanding the inode based approached does not take into account the timestamp if the inode does not match.

However a second problem arose from @allinurl test case. Using piping the test.log into GoAccess, generates a report of 4 Total Requests, when it should show 10 Total Requests. Besides this, if I execute the test.sh script multiple times, 3 more requests are added to the Total Requests which is impossible since it should be using the timestamp based approach. I'm leaving the details below so you can try and reproduce. I'm using the latest build from the master branch.

test.log:

10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/59123452634389 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.003
10.0.5.124 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/53453457649990 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.002
10.0.6.123 - - [21/Oct/2020:23:26:57 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.024
10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/34554492634389 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.002
10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/45345345634345 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.002
10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/53454572022944 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.5.5 curl/7.64.0 PHP/7.2.22" "-" "test.example.com" 0.002
10.12.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/5345476549334 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.5.5 curl/7.64.0 PHP/7.2.22" "-" "test.example.com" 0.003
10.13.5.123 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "GuzzleHttp/6.3.3 curl/7.29.0 PHP/7.0.33" "-" "test.example.com" 0.027
10.13.5.122 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "okhttp" "-" "test.example.com" 0.027
10.0.5.126 - - [21/Oct/2020:23:27:00 +0100] "GET /api/do/this/please/34534594140034 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.004

test.conf

config-dialog false
hl-header true
json-pretty-print false
no-color false
no-column-names false
no-csv-summary false
no-progress true
no-tab-scroll false
no-parsing-spinner true
with-mouse false
no-global-config true
agent-list false
with-output-resolver false
http-method yes
http-protocol no
no-query-string true
no-term-resolver false
444-as-404 false
4xx-to-unique-count false
all-static-files true
double-decode false
ignore-crawlers false
crawlers-only false
ignore-panel REFERRERS
ignore-panel KEYPHRASES
real-os true
sort-panel REQUESTS,BY_CUMTS,DESC
static-file .css
static-file .js
static-file .jpg
static-file .png
static-file .gif
static-file .ico
static-file .jpeg
static-file .pdf
static-file .csv
static-file .mpeg
static-file .mpg
static-file .swf
static-file .woff
static-file .woff2
static-file .xls
static-file .xlsx
static-file .doc
static-file .docx
static-file .ppt
static-file .pptx
static-file .txt
static-file .zip
static-file .ogg
static-file .mp3
static-file .mp4
static-file .exe
static-file .iso
static-file .gz
static-file .rar
static-file .svg
static-file .bmp
static-file .tar
static-file .tgz
static-file .tiff
static-file .tif
static-file .ttf
static-file .flv
static-file .json
static-file .tpl
persist true
restore true
time-format %T
date-format %d/%b/%Y
log_format %h - %^ [%d:%t %^] "%r" %s %b "%R" "%u" "%^" "%v" %T
db-path data/test/
keep-last 10

test.sh:

#!/bin/bash
[[ ! -d data/test ]] && mkdir -p data/test/

cat test.log | goaccess --process-and-exit -p test.conf -

goaccess -p test.conf -o test.html

allinurl commented 3 years ago

@rtista Thanks for posting this. I'm looking into this and I'll post back asap.

rtista commented 3 years ago

Thank you very much @allinurl, take your time! I'm just trying to be of service in any way I'm able! ;)

0bi-w6n-K3nobi commented 3 years ago

Hi @allinurl.

The timestamp strategy is the problem. The second record with same timestamp will be ever ignored. Remind us that PIPE is same that a file with inode == 0. Of course, same processing will be correct if you disable --restore and --persist option.

We will have to think of something more elaborate. Who knows, a simplified CRC of record + timestamp for the case of the input is PIPE? I.E inode == 0.

See you soon.

allinurl commented 3 years ago

@rtista @0bi-w6n-K3nobi That's right. The issue occurs with if (!glog->inode && lp.ts >= glog->lp.ts) . When attempting to restore from disk, and no inode (a pipe), we compare timestamps but it's exclusive. This was implemented per this comment.

The timestamp comparison needs to be exclusive of the last time, as I would rather miss a few lines than double-count a few, but I don't feel very strongly about this.

We could change it to be inclusive and there's a risk of having dups,keeping it the way it it is, or like you said, we could get more elaborate with e.g., crc + ts. Thoughts?

allinurl commented 3 years ago

By the way, the checksum will probably need to be of multiple lines since two or more consecutive requests could potentially be the exact same.

rtista commented 3 years ago

@allinurl and @0bi-w6n-K3nobi are you sure that that explains the behavior I described?

When I first import the dataset (test.log) via pipe, I'm getting 4 Total Requests analyzed, and these requests match the first request for each timestamp:

Requested Files Panel:

First request for each timestamp in the dataset (test.log):

10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/59123452634389 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.003
10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/45345345634345 HTTP/1.1" 200 34 "-" "-" "-" "test.example.com" 0.002
10.13.5.123 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it HTTP/1.1" 200 7 "-" "GuzzleHttp/6.3.3 curl/7.29.0 PHP/7.0.33" "-" "test.example.com" 0.027
10.0.5.126 - - [21/Oct/2020:23:27:00 +0100] "GET /api/do/this/please/34534594140034 HTTP/1.1" 200 34 "-" "GuzzleHttp/6.3.3 curl/7.67.0 PHP/7.2.31" "-" "test.example.com" 0.004

When I import the same dataset once more, the Total Requests is incremented by 3, resulting in the addition of the following requests:

These 3 requests correspond to the first for each timestamp (21/Oct/2020:23:26:58 +0100, 21/Oct/2020:23:26:59 +0100, 21/Oct/2020:23:27:00 +0100) excluding the first timestamp (21/Oct/2020:23:26:57 +0100), could we maybe be saving the timestamp of the first request instead of that from the last request?

Also, explain something to me, when we import from pipe (with --restore and --persist), we:

Retrieve the last imported request timestamp from the database
Process all the records if they are larger than that timestamp

Or:

Retrieve the last imported request timestamp from the database
Process the first record that is larger than that timestamp (retrieved from DB) and update the last timestamp to the timestamp of this request which was just processed and continue processing

?

Because if I'm importing the dataset for the first time, all of the 10 dataset lines should be processed into the database and not only 4. I think what I mean is that the timestamp check should only check if the request timestamp is exclusively larger than the timestamp from the last request in the database and not check the timestamp between requests of the same piped data.

0bi-w6n-K3nobi commented 3 years ago

Good morning, @rtista. The same here!

For first, I got:

and for second:

0bi-w6n-K3nobi commented 3 years ago

Yeah. You are right, @allinurl. Checksum probably will have problem with consecutive requests that exactly the same. Soon after post above I concluded this too. And checksum will be add some overhead about request processing.

Thought of the marker something like: timestamp + repeated timestamp count

but I think it can be so difficult to implement. I'll be more clear, using the LOG above, the marker would be:

timestamp1 + 1 -- 10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/59123452634389 ... timestamp1 + 2 -- 10.0.5.124 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/53453457649990 ... timestamp1 + 3 -- 10.0.6.123 - - [21/Oct/2020:23:26:57 +0100] "POST /api/dont/do/it ... timestamp1 + 4 -- 10.0.5.123 - - [21/Oct/2020:23:26:57 +0100] "GET /api/do/this/please/34554492634389 ... timestamp2 + 1 -- 10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/45345345634345 ... timestamp2 + 2 -- 10.0.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/53454572022944 ... timestamp2 + 3 -- 10.12.5.123 - - [21/Oct/2020:23:26:58 +0100] "GET /api/do/this/please/5345476549334 ... timestamp3 + 1 -- 10.13.5.123 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it ... timestamp3 + 2 -- 10.13.5.122 - - [21/Oct/2020:23:26:59 +0100] "POST /api/dont/do/it ... timestamp4 + 1 -- 10.0.5.126 - - [21/Oct/2020:23:27:00 +0100] "GET /api/do/this/please/34534594140034 ...

In this algorithm, when we are processing the current file this makes it possible to process all records/requests. This is because we also considered counting the record with the same timestamp.

But, again, I think it can be difficult to be implemented. Because we will always have to calculate the number of repetitions for the current timestamp.

0bi-w6n-K3nobi commented 3 years ago

PS: timestamp1 + 1 isn't a aritmetic add but a compose key with two values. or... can be like a timestamp + 1/1000 in aritmetic add.

rtista commented 3 years ago

PS: timestamp1 + 1 isn't a aritmetic add but a compose key with two values. or... can be like a timestamp + 1/1000 in aritmetic add.

Good morning @0bi-w6n-K3nobi :smile:

Why would we need to do this? Can't we just allow all timestamps that are larger than the one from the last request in the database?

Comparing the records of the same log between themselves creates the restriction that the log must be sorted by timestamp, which requires extra processing when concatenating several logs from the same application (generated in different machines). We only need to compare all the piped record's timestamp against the timestamp from the last inserted log record, as users just want to reprocess the same log without adding duplicates.

If the problem is not in the comparison, but in the inserting into the database, we could probably generate a key of the timestamp and the piped data iteration (the index of the record in the file). That would already provide enough entropy I believe without becoming too complex, albeit being a little under-engineered.

0bi-w6n-K3nobi commented 3 years ago

@rista, good afternoon.

I agreed about restriction of processing records of the same log. But, about database it's more complicated. The implemented solution aims at not reprocessing records from the same LOG. And in particular any problem that exists when it is rotated [ex. logrotated]. Therefore, it is impossible/impracticable use the index of the record in the file.

In addition, we will have to find a solution together with @allinurl.

0bi-w6n-K3nobi commented 3 years ago

Hi, @allinurl

Perhaps now is a good time to implement --ignore-timestamp option, like a workaround for @rtista process her LOGs. And soon elaborate a more complete/useful solution for marker with timestamp.

See you soon.

rtista commented 3 years ago

@0bi-w6n-K3nobi and @allinurl , ignoring the timestamp is not entirely what I want or need, there are two distinct problems here:

Importing the log/dataset for the first time seems to import only the first entry for each timestamp.
Importing the same log for the second time seems to import only the first entry for each timestamp after ignoring the first timestamp in the log/dataset.

From my understanding the solutions for the problems should be:

Do not compare timestamps of the same log/dataset between themselves, ensure only that they are larger than the one from the last request present in the database. If the timestamps are database unique keys, implement a strategy such as the one suggested by @0bi-w6n-K3nobi (timestamp + repetition count).
Ensure that the timestamp from the last request is retrieved from the database as, per this test, it seems that the timestamp from the first request is the one being retrieved (since only the requests with the first timestamp are being ignored).

0bi-w6n-K3nobi commented 3 years ago

Hi @rtista,

your conclusions/observations are very common sense.

Hi @allinurl,

I believe that we can use the ideas of @rtista, mainly for PIPE input. In the case of common LOG, it would probably be necessary to take into account what has already been implemented in patch 2e14dc3.

We can then summarize these ideas as: a) with the --restore option, load the value of the last marker timestamp + inode BD_TS for each LOG and PIPE; b) when processing LOG/PIPE use the global value BD_TS, and not the value of the last record; c) with the --persist option, save the highest value of timestamp in the BD_TS field when finishing the processing.

These conditions, which I think should be implemented at least for the PIPE input, would allow/accomplish:

Process a PIPE input as a concat of several LOGs, either from different VHOSTs or from different servers.
The BD_TS value restriction applies only to the initial timestamp of the records and not to the order or final value of these timestamp.
This allows each entry to be processed daily, without being able to go back in time. But do not constrain [order] the dates/timestamp with each other in the same LOG/PIPE.

These conclusions lead to a consideration or consequence: the user becomes more responsible for what he is doing. For example, should be correct if:

Concatenate the LOGs of the same day as a PIPE input.
Concatenate day after day in chronological order.

But, all of them leads to greater freedom and not to ignore requests that were before.

See you soon.

allinurl commented 3 years ago

@rtista Thanks for sending those ideas. I think the issue right now, at least for the initial parse is the use of --restore, that creates the conflict as you are not restoring anything at that point. Thus, I believe @0bi-w6n-K3nobi was suggesting the --ignore-timestamp?

@0bi-w6n-K3nobi can I ask what does BD_TS mean here?

0bi-w6n-K3nobi commented 3 years ago

Hi @allinurl.

BD_TS is value of maker timestamp + inode that already has been save BTree Hash on disk with a option --persist.

I used this term only be more clear, that it must be used for compare with timestamp of current request and not the timestamp from previous record. I tried to compile the ideas from @rtista above.

rtista commented 3 years ago

I think the issue right now, at least for the initial parse is the use of --restore, that creates the conflict as you are not restoring anything at that point.

@allinurl I can confirm this, without the --restore option the first import does indeed import all of the 10 requests.

The second import however keeps importing only the first request for each timestamp which is larger than the first request timestamp and not larger than the last request timestamp.

allinurl commented 3 years ago

@0bi-w6n-K3nobi I'm trying to understand a bit more your post. Could you explain what would be the purpose of the global timestamp + inode value?

0bi-w6n-K3nobi commented 3 years ago

Hi @allinurl.

I will try. So I will elaborate and post here soon as possible.

0bi-w6n-K3nobi commented 3 years ago

Hi @allinurl.

Well... When I told global it is just a way of expressing myself. The correct term would be per session/running.

So, come on. When GOACCESS runs with the--restore option and the PIPE input -- our main difficulty -- I believe the behaviour should be:

load the value of the marker timestamp + inode -- where PIPE is equal to say that inode == 0 -- ok, it's ready;
when processing a request than use/compare the maker from load -- and not timestamp value of the previous record;
with the --persist option, when finish the processing than save the highest value found of timestamp of current PIPE/LOG into disk, into maker field.

I call this process with nickname of per session/running maker.

As I told before, lead to some consequences:

User becomes more responsible for what he is doing.
The origin from disk of maker -- applies restriction only to initial [starter point] timestamp of requests that will be processed, and not the order or highest value of these timestamp, and ..
Concatenation the LOGs of the same day, can be used as PIPE input without problems -- i.e all requests will be computed;
Processing must be day-by-day in cronologic order -- i.e. no way to go back in time.

From physical LOG [not PIPE], I believe yet will be need add some complex rules like your patch 2e14dc3. According to the reasons that led to this patch.

I hope than was more clearer. Otherwise give me feedback on this. I see you soon.

0bi-w6n-K3nobi commented 3 years ago

Well ... @allinurl it seems pretty easy.

Just move the call to ht_insert_last_parse (0, glog-> lp), per file and PIPE, to the end when saving to disk. And not by each processed request. This is the challenge.

I see you soon.

rtista commented 3 years ago

Hey guys, just to say that I agree with @0bi-w6n-K3nobi 's implementation, it seems to solve my issues and require less pre-processing when feeding piped data from several files, as this data no longer needs to be previously ordered by timestamp (a very time-consuming process for high amounts of logs) only to be larger than the last persisted timestamp.

allinurl commented 3 years ago

@rtista glad to hear that helped. @0bi-w6n-K3nobi, sorry, I guess I'm not seeing the fix, could you please elaborate a bit more how moving ht_insert_last_parse (0, glog-> lp) to the end (of?) will address this issue when piping data in? Thanks

By the way, just to be clear, I think the current implementation works fine when processing a log. I think the issue we're having is when feeding from a pipe since there's no last line or size of the pipe.

0bi-w6n-K3nobi commented 3 years ago

Hi @allinurl.

Let's take a closer look at the code below, into parse.c:

static int
should_restore_from_disk (GLog * glog) {
  GLastParse lp = { 0 };

  if (!conf.restore)
    return 0;

  lp = ht_get_last_parse (glog->inode);       **  1  **

  /* restoring from disk, and no inode (probably a pipe), compare timestamps then
   * (exclusive) */
  if (!glog->inode && lp.ts >= glog->lp.ts)   **  2  **
    return 1;
...

static int
pre_process_log (GLog * glog, char *line, int dry_run) {
...

  glog->lp.ts = mktime (&logitem->dt);        **  3  **

  if (should_restore_from_disk (glog))
    goto cleanup;

  count_process (glog);
...

cleanup:
  free_glog (logitem);

  ht_insert_last_parse (0, glog->lp);         **  4  **

  return ret;
}
...

static int
read_log (GLog ** glog, const char *fn, int dry_run) {
...

  /* insert the inode of the file parsed and the last line parsed */
  if ((*glog)->inode) {
    (*glog)->lp.line = (*glog)->read;
    ht_insert_last_parse ((*glog)->inode, (*glog)->lp);  **  5  **
  }

As you noted, this code works well for normal LOG. I.E. is not PIPE.

The first observation/problem is with this code is the line 4 . Remembering here that first param is inode. This problem is because then all LOGs and even the PIPE input will update last_parse with entry 0 in HASH, for each request processed. But the entry 0 is matches PIPE input too.

The line 1 will retrieve last_parse with inode entry in HASH. And if it is PIPE than it will be with entry0. Oops ... but at 4 the entry 0 was saved for each record.

So, if we are processing the PIPE input -- The line 2 will only process requests with timestamp greater than last_parse loaded on the line 1 . But this value has been updated for each record in line 4 . The problem is HERE!

To process per session, as I mentioned above, we have to:

Unlink the last_parse global from last_parse PIPE, i.e. entry == 0 -- i.e change line 4 ;
Save last_parse PIPE at the same point as if it were a physical LOG -- i.e. include in the line 5 .

In other words, the code would be:

...
  ht_insert_last_parse (UINT_MAX, glog->lp);         **  4  **

...
--  if ((*glog)->inode) {  // remove this for include PIPE saving too.
    (*glog)->lp.line = (*glog)->read;
    ht_insert_last_parse ((*glog)->inode, (*glog)->lp);  **  5  **
--  }

This will ensure us that value of lp.ts in line 2 was loaded from the previous session, and not from the previous record. Therefore, applying this modification, the requests for the PIPE input will be only ignored if the timestamp is less than the previous session -- saved in last_parse HASH entry 0.

Now, is it clearer? Ok, there may be some side effect that I haven't thought of.

See you soon.

0bi-w6n-K3nobi commented 3 years ago

Hi @allinurl. Good-morning.

Soon after post above, I thought to myself... What is the purpose of the line 4 ? What did you think to solve with it? I talk about it right after proposing the modification above. And therefore, the line 4 will definitely ask for its meaning.

However, I think it is even more complex to implement the highest timestamp value for saving HASH to disk. This is what we want to implement for the PIPE input. Here, I will try to describe the process that should be implemented.

We can return here the old idea of the flag --ignore-timestamp. But now, I think of something more elaborate and the most appropriate name would be flag --ordered-timestamp.

For this new flag, we would have the associated variable conf.ordered_timestamp, under the following conditions:

ordered-timestamp = yes -- conf.ordered_timestamp = 1;
ordered-timestamp = no -- conf.ordered_timestamp = 0;
ordered-timestamp unset -- conf.ordered_timestamp = -1;

And we can assume if ordered-timestamp is unset, i.e. conf.ordered_timestamp == -1, the default behaviour for program due input type:

LOG input -- ordered-timestamp = yes;
PIPE input -- ordered-timestamp = no;

Then we can make the following changes to the code:

static int
should_restore_from_disk (GLog * glog) {
  GLastParse lp = { 0 };
...
  if (!glog->inode) {
    if(conf.ordered_timestamp == 1 && lp.ts > glog->lp.ts)
      return 1;
    else
      return 0;
  }
...
  if (conf.ordered_timestamp == 0 || glog->lp.ts >= lp.ts)
    return 0;
...
  else if (glog->size < lp.size && (conf.ordered_timestamp == 0 || glog->lp.ts == lp.ts))
    return 0;
...
  else if (glog->size > lp.size && glog->read >= lp.line && (conf.ordered_timestamp == 0 || glog->lp.ts == lp.ts))
    return 0;

Here, one observations to make -- please, pay attention for symbols > and >= : I change the comparasions tolp.ts>glog->lp.ts and glog->lp.ts>=lp.ts. I believe that it is better to include requests with the same timestamp than to ignore them. Keep in mind a server that has many requests in the same second.

Continuing the changes, for find and save highest timestamp value ...

static int
pre_process_log (GLog * glog, char *line, int dry_run) {
...
cleanup:
  free_glog (logitem);

  session_lp = ht_get_last_parse_session (glog->inode);
  if (session_lp.ts < glog->lp.ts) {
    session_lp.ts = glog->lp.ts
    ht_insert_last_parse_session (glog->inode, session_lp);
  }
...

static int
read_log (GLog ** glog, const char *fn, int dry_run) {
...
  if ((*glog)->inode) {
    session_lp = ht_get_last_parse_session (glog->inode);
    (*glog)->lp.line = (*glog)->read;
    (*glog)->lp.ts = session_lp.ts;
    ht_insert_last_parse ((*glog)->inode, (*glog)->lp);
  } else {
    session_lp = ht_get_last_parse_session (0);
    (*glog)->lp.line = 0;     /* Ignore # of lines ??? */
    (*glog)->lp.ts = session_lp.ts;
    ht_insert_last_parse (0, (*glog)->lp);
  }

If you don't want to build a new HASH function, you can use the trick:

#define ht_insert_last_parse_session (_inode, _lp)  ht_insert_last_parse (INT_MAX +_inode, _lp)  
#define ht_get_last_parse_session (_inode)          ht_get_last_parse (INT_MAX + _inode)

But, do not forget to delete these values before saving to disk.

Now, I think I have addressed the whole problem. Then that's it.

------- PS : Just a note -------

An observation - through an ex. - off course: What would be the behaviour of program due the line 5 , if in the 1st execution we do:

goaccess --restore --persist log-a log-b

and in the 2nd run then:

goaccess --restore --persist log-b log-a

We are assuming that LOGs are static, unchanged. Oops... OK. Everything is fine if you run GOACCES with the LOGs ever in the same order, or single LOG. Doing something more elaborate, regardless of the order of the LOGs, will have to change the logic of the whole program. Then, it use line delta [# line from individual LOG] instead of the current global # line.

------- End a note -------

Phew! See you soon.

0bi-w6n-K3nobi commented 3 years ago

Hi @allinurl.

For the note above, if you want to develop/think about... We would have to use individual values for each LOG, instead of the global value of glog->size and glob->read.

Perhaps, you can use 2 more variables in the GLog structure, since the program processes one LOG at a time. And also use a session HASH function and save the value at the end, in read_log.

See you soon.

allinurl commented 3 years ago

Hi @0bi-w6n-K3nobi Thanks for all the details on this. First of all, are we trying to fix the issue where some duplicate values are not counted because the last timestamp is equal to the currently read timestamp (chronologically ordered data, e.g., 29/Oct/2020, 30/May/2020, 31/Oct/2020) or are we addressing the issue where some values are not counted because the pipe or log contains entries that are not chronologically ordered, e.g., 31/Oct/2020, 10/May/2020, 30/Oct/2020?

Also, I want to ask you, you mentioned the following:

So, if we are processing the PIPE input -- The line 2 will only process requests with timestamp greater than last_parse loaded on the line 1 . But this value has been updated for each record in line 4 . The problem is HERE!

True, the value timestamp is updated for each record as you noted, but I still fail to see how moving it to read_log() would solve the issue. glog->lp would still contain the last parsed value (highest timestamp, assuming chronologically ordered data).

Sorry, I may be missing something from your post.

0bi-w6n-K3nobi commented 3 years ago

Hi @allinurl

Yeah. You are right. If we are processing only with PIPE input, the code led to this. In post, I tried give a a more complete and complex solution.

Yeah. I will try debug and understand what the problem with this case, in particular. So, I will add some necessary rectifications/notes.

Thanks to show me what was the exact point/problem.

See you soon.

0bi-w6n-K3nobi commented 3 years ago

Hi @allinurl.

After thinking a little about the code, I think I can explain why it does not have the expected behaviour. Yeah, I didn't have to do DEBUG for that. ;)

First, let's use a visual resource for this. Then, we can run the program with the following call:

cat test.log | awk '{$7 = NR""$7; print $0}' |goaccess -p test.conf -o test.html

If you prefer, you can run in 2 steps as described above. The results are the same in both cases. If we do this we see exactly what requests are processed.

So, come on. Look you carefully at the post with the source code and line numbering. The first time we run the program we will have, for processing requests:

 #   load hash TS/line ** 1 **   Valid/Processed   save hash ts/line **  4  **
 1      ---/0                         yes             21/Oct/2020:23:26:57
 2      21/Oct/2020:23:26:57          no                            ...:57
 3                    ...:57          no                            ...:57
 4                    ...:57          no                            ...:57
 5                    ...:57          yes             21/Oct/2020:23:26:58
 6      21/Oct/2020:23:26:58          no                            ...:58
 7                    ...:58          no                            ...:58
 8                    ...:58          yes             21/Oct/2020:23:26:59
 9      21/Oct/2020:23:26:59          no                            ...:59
10                    ...:59          yes             21/Oct/2020:23:27:00

This is a simple execution table, request after request, with the values of TIMESTAMP loaded in line ** 2 *, and saved in line 4 . Remembering that the HASH on disk is empty when starting this 1st execution. And that the TIMESTAMP saved in line 4 ** is the one of the request itself. You can check the result in the test.html file. The Requested Files (URLs) panel will show the requests # 1, 5, 8 and 10. So far the behaviour is exactly what you are expecting from it.

Is it clear so far? Now, suspense ... What will happen? :)

Let's go to processing, running the 2nd time:

 #   load hash TS/line ** 1 **   Valid/Processed   save hash ts/line **  4  **
 1      21/Oct/2020:23:27:00          no              21/Oct/2020:23:26:57
 2      21/Oct/2020:23:26:57          no                            ...:57
 3                    ...:57          no                            ...:57
 4                    ...:57          no                            ...:57
 5                    ...:57          yes             21/Oct/2020:23:26:58
 6      21/Oct/2020:23:26:58          no                            ...:58
 7                    ...:58          no                            ...:58
 8                    ...:58          yes             21/Oct/2020:23:26:59
 9      21/Oct/2020:23:26:59          no                            ...:59
10                    ...:59          yes             21/Oct/2020:23:27:00

Again, looking at the test.html file and its panel, we will see: 2 x requets # 5, 8 and 10 and 1x to # 1.

It's clear? It is clear that the 4 line is the problem. Transferring it to read_log () will prevent this behavior. Then, the TIMESTAMP loaded in line 2 will always be the value of the previous execution, from saved Hash on disk, and not the previous request.

I hope I have now clarified the problem. I see you soon.

0bi-w6n-K3nobi commented 3 years ago

PS: What are we addressing with?

The issue where some values are not counted because the PIPE -- moving line 4 to read_log(). The issue where log contains entries that are not chronologically ordered -- creatting the flag --ordered-timestamp with default behaviour [flag unset] due to input type pyshical LOG or PIPE.

allinurl commented 3 years ago

Patch applied that solves the PIPE issue with missing count and restoring from disk. Please let me know how it goes by building from upstream. I'll be releasing v1.4.1 pretty soon. Thanks again!

allinurl / goaccess

Duplicated request entries on "Requested Files" panel #1945