allinurl / goaccess

GoAccess is a real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser.
https://goaccess.io
MIT License
18.44k stars 1.11k forks source link

Duplicated request entries on "Requested Files" panel #1945

Closed rtista closed 3 years ago

rtista commented 4 years ago

Hello! I'm currently trying to automate the report generation for a project which contains several deployed nodes, these nodes write to different logs, and I'm generating the report for those logs. Each of these logs relates to a day of accesses and they are rotated daily around 3AM.

# Concatenate log files into a single file
zcat -f $LOGS | sort -k $SORT > $PARSED_LOG

# Process log variations
bash $BASEDIR/process-log-variations.sh $PARSED_LOG

# Import log data into BTREE databases
goaccess --process-and-exit -p $conf $PARSED_LOG

# Generate HTML from BTREE databases
goaccess -p $conf -o $REPDIR/report.html

My script receives several log files and concatenates them together, sorts them by the date field, and then using SED removes variation fields, such as IDs, Hashes, and others from the requests URL replacing them with the word "$var". A single log file is generated from this process which I then feed to GoAcess, first persisting the data and then updating the report HTML with data from the last log file.

The problem is that, after loading 4/5 days of logs, in the "Requested Files" panel, the behavior in the following image happens:

image

As you can see, there are several "GET /api/auth/token" when only one should be appearing. Does anyone know why this happen? Is it a bug or am I doing something wrong?

0bi-w6n-K3nobi commented 4 years ago

Hi @allinurl .

I will look the code and check running with LOG from @rtista, as soon as possible.

Github is weird here. When I preview the patch it shows the parse.c code all wrong, mixing it with other files. But, when I view the entire file it is correct ...

Glad I can help.

rtista commented 4 years ago

Hey @allinurl ! Thank you very much for the great work! I'll test it out as soon as possible and will report back my findings here ;)

0bi-w6n-K3nobi commented 4 years ago

Great, @allinurl !

It worked like a charm. I did the tests with LOG as I mentioned above.

Very good your idea of the is_likely_same_log function. I believe it is better than crc line, or at least it is less expensive. Thanks for adding credits to me, in case of of PIPE input for modifying the save point for ht_insert_last_parse. I think the biggest job was yours.

Do you plan to implement something similar to the --ordered-timestamp flag in the future? Also implement a line counter per file instead of global? -- See the note in the comment above for you understand what I'm talking about.

And finally, I believe that these comments need some cosmetic adjustment:

parse.h

typedef struct GLog_ {
...
  uint32_t inode;
  uint64_t bytes;               /* bytes read */
  uint64_t size;                /* bytes read */ ???

parse.c

??? maybe some about this ???
static int
clean_old_data_by_date (uint32_t numdate) {

static int
should_restore_from_disk (GLog * glog) {
...
  /* If our current line is greater or equal (zero indexed) to the last parsed
   * line and have equal timestamps, then keep parsing then */  ??? where is timestamp ???
  if (glog->inode && is_likely_same_log (glog, lp)) {
    if (glog->size > lp.size && glog->read >= lp.line)

??? maybe some about this ???
static void
process_invalid (GLog * glog, GLogItem * logitem, const char *line) {

Well ... I think the code is ready for version 1.4.1. You did a good job.

See you later :)

0bi-w6n-K3nobi commented 4 years ago

PS: Ok, I know: The code does not oblige the file to be ordered for the PIPE input or physical LOG. Just compare TIMESTAMP with the previous section -- load from disk. But, maybe the user wants to force the ordering or not ordering ...

rtista commented 3 years ago

Hey @allinurl ! Sorry about the delay, but I presented my master thesis this week and it was a crazy week! :D I have just tested out the new patch and it works wonderfully!! Thank you very much for you work! It's awesome! I'll be closing the issue now! ;)

allinurl commented 3 years ago

Thanks @0bi-w6n-K3nobi and @rtista for letting me know. Stay tuned, I expect the new release to be out this coming week.