Real-time with wrong accounting hits

0bi-w6n-K3nobi commented 3 years ago

I have some problems with real-time. All problems refer to the incorrect accounting hits. I'm using version 1.4 with MaxMindDB and readline.

Preview: I made several attempts to get around the problem, and only via pipe '-', the real-time behaves properly. I believe that the problem lies in the TAIL routine of each LOG.

Scope: I have 4 NGinX servers. Each has 14 logs for the site that I want to monitor, due to access to each subdomain. This is the project I need to implement.

1st attempt: I implemented GoAccess directly on 1 of the servers, over NGinX. And directly reading the LOGs without any type of filtering.

Note: This is not the solution you are looking for because no filter has been applied. Only 1 of the servers was also used. About 200 lines are generated per sec.
Problem: There is a significant accounting lost in interval where CPU overloads close to 100%. Although the accounting is more or less correct. In this case I lost about period of 10 minutes.

2nd attempt: I made a script to send the LOGs concatenated and sent via TAIL and SOCAT. I also introduced a filter via AWK and a wait for each portion of records. Thus generating a single LOG that grows slowly until synchronizing in real-time.

Note: I took care that the LOG sync was line buffered. The CPU was roughly 20%.
Problem: The final LOG is not ordered chronologically because it is the concatenation of 14 LOGs. Despite generating it slowly, GoAccess presents a big error in accounting. In 4 million of records it accounting was 5 million.

0bi-w6n-K3nobi commented 3 years ago

I found where the problem:

At procedure perform_tail_follow in goaccess.c ... See the following piece of code:

size2 = file_size (fn);
...
  if (!fseeko (fp, *size1, SEEK_SET))
    parse_tail_follow (fp);
  fclose (fp);

  *size1 = size2;

If the LOG file grows quickly, the parse_tail_follow routine can be read beyond of size2. And in the next LOOP it will process again from position size2 onwards. This it will process the same records more than once time.

The correct thing would be:

  if (!fseeko (fp, *size1, SEEK_SET))
    size3 = parse_tail_follow (fp);
  fclose (fp);

  *size1 += size3;

Since parse_tail_follow routine must return the number of bytes read. Tip: when implementing this, it could also be used to update the size in the pipe' case.

0bi-w6n-K3nobi commented 3 years ago

I found where 2nd one, when CPU is overload:

Look, in main at goaccess.c

  /* main processing event */
  time (&start_proc);
  if ((ret = parse_log (&glog, NULL, 0))) {
    end_spinner ();
    goto clean;
  }
...
  if (conf.process_and_exit) {
  }
  /* stdout */
  else if (conf.output_stdout) {
    standard_output ();
  }

First, the program will process all LOGs, from the beginning to the EOF. Right after executing standard_output that will call process_html.

At process_html ...

  for (i = 0; i < conf.filenames_idx; ++i) {
    if (conf.filenames[i][0] == '-' && conf.filenames[i][1] == '\0')
      size1[i] = 0;
    else
      size1[i] = file_size (conf.filenames[i]);
  }

This should be OK. But it's not. If you think about the condition of several large LOGs, which will bring the CPU to overload. It will take considerable time for the program to reach this point, from initial EOF, returned again at same LOG. Then the program will lose the piece of this LOG, corresponding to the waiting time for processing the others and return to this one, until this TAIL mechanism, which can be seen below:

  while (1) {
    if (conf.stop_processing)
      break;

    for (i = 0; i < conf.filenames_idx; ++i)
      perform_tail_follow (&size1[i], conf.filenames[i]);       /* 0.2 secs */

Solution: Save the endpoint (EOF position) in the 1st processing to resume from this point. This will be complicated since there is no global variable for this position and the LOG is not kept open.

0bi-w6n-K3nobi commented 3 years ago

Here, I propose a simple and efficient solution. I tried to make the minimum intervention and in a simple and very obvious way. Please consider this patch. At least evaluate the result. I already tested it and put it into production since it is working correctly.

The Problem: Basically, the problem is caused by the large time interval between reading the LOG and checking its size. There is a loss between the points: after reading the LOGs, until EOF, and the TAIL routine; There is reprocessing between the points: take LOG size and LOG TAIL routine, which [read] advances beyond this size.
Solution: Minimize these time intervals and make the bytes counting [already read] more accurate.
Implementation: Remove the local file variable size1, in the various routines, replacing it with a global variable glob-> filesizes; Add a count of last bytes of the current LOG glob->size.
PS: This implementation is also useful for STDIN / PIPE; Needing to implement, and thus be more accurate in real-time, in the routine get_log_sizes.

realtime-patch.txt

allinurl commented 3 years ago

Thank you for reporting this and working out a patch. I'm looking into this. I'll post back asap.

allinurl commented 3 years ago

I can merge this. Do you want me to commit the patch from my end, or are you able to submit a PR? Let me know. Thanks

0bi-w6n-K3nobi commented 3 years ago

You can apply, please just add the credits.

Would it be difficult for me to be a collaborator too? Or at least exchange some ideas about the project? Thanks.

allinurl commented 3 years ago

Awesome, are you able to share your email or at least the github noreply email so I can add it to the commit for the proper credits.

By all means, I'm happy with any collaboration, please feel free to share any ideas either here or shoot me an email to hello [at] goaccess.io! Thank you, sir!

0bi-w6n-K3nobi commented 3 years ago

The pleasure is mine. My email is 71027865+0bi-w6n-K3nob [at] users.noreply.github.com

I have used the software extensively. I am deploying real-time in 4 large domains each rougly about 10 ~ 30 million HITS by day, in my company.

Yes. I will certainly share ideas and solutions. See you later.

allinurl commented 3 years ago

Merged. Thanks again.

I would certainly be interested to hear more how goaccess is working for you and the load it's handling :)

0bi-w6n-K3nobi commented 3 years ago

Great. I would like share it with you too. And tell you about my company and my experiences.

So, where I write about it? At hello [at] goaccess.io ?

allinurl commented 3 years ago

Yep, hello [at] goaccess.io is perfect!

allinurl commented 3 years ago

@0bi-w6n-K3nobi I noticed there's an issue with the patch that was apply for this issue. The issue happens under perform_tail_follow(), specifically line 807.

For instance, *size1 += glog->bytes; will keep the count from the last iteration and add the new bytes, however, this won't work when the log gets truncated. So before I attempt to fix this, I'm trying to understand a bit more the need of *size1 += glog->bytes; instead of simply assigning the new size. e.g., *size1 = size2;. Please let me know if that makes sense. Thanks

allinurl commented 3 years ago

Hello @0bi-w6n-K3nobi. I made some changes to the perform_tail_follow(), could you please test it out and let me know how it goes. I want to make sure things are working fine. Thanks!

0bi-w6n-K3nobi commented 3 years ago

Hi @allinurl .

Sorry about no reply soon you. I was stay so busy in this week.

This patch was suggested/applied for solve the real-time problem with LOGs. The problem is exactly if you simply assign the new size *size1 = size2 and LOGs is increasing sudden.

To better explain, look you the code bellow carefully:

  size2 = file_size (fn);            **  1  **
...
  if (!fseeko (fp, *size1, SEEK_SET))
    parse_tail_follow (fp);
  fclose (fp);                       **  2  **

  *size1 += glog->bytes;

In the time interval between the assignment of the value to size2 [ at point 1 ] until the end of the execution of the routine pase_tail_follow [ point 2 ] the LOG may grow far beyond size2 . [ We can call this one size3 ]. In this case, the pase_tail_follow routine will process records up to size3, i.e. beyond of size2. And on the next run, it will then reprocess the records in the range between size2 andsize3,
if you assign *size1 = size2.

This is exactly what happened to me. My real-time processing had 20% more records than correct. Therefore, this patch did resolve the problem.

I hope I was clear. :)

allinurl commented 3 years ago

Thank you sir!

allinurl / goaccess

Real-time with wrong accounting hits #1913