fstab / grok_exporter

Export Prometheus metrics from arbitrary unstructured log data.
Apache License 2.0
897 stars 152 forks source link

grok_exporter exits when monitored file is moved away #124

Open tass11 opened 4 years ago

tass11 commented 4 years ago

We run grok_exporter with the following path config (sensitive samples removed):

input:
  type: file
  fail_on_missing_logfile: false
  paths:
  - "/path/to/logs/*.log"

Regularly, after certain events, a subdir is created in the logs directory, and all log files in the monitored directory are moved to that subdir. Sometimes at this point grok_exporter would exit, with RC=-1 reporting this problem to syslog:

Aug 20 08:28:09 hostname grok_exporter[10133]: error reading log lines: watch list inconsistent: received a file system event for an unknown directory
Aug 20 08:28:09 hostname systemd[1]: grok-exporter.service: main process exited, code=exited, status=255/n/a

Looks like grok_exporter is receiving an inotify with unexpected path, which triggers immediate exit.

Such behavior is very undesired, because exporter drops all the collected state and jeopardizes stats collected and reported to monitoring.

Is there a workaround to it?

Could you please fix grok_exporter to silently ignore and/or drop the file from watchlist, when receiving inotify from unknown dir, instead of jumping outside of window?

fstab commented 4 years ago

This should not happen.

Before I add a fix ignoring the event, I'd like to try to understand what causes this. Do you see any way how I could reproduce this?

I assume you are running on Linux, as you mention inotify.

I tried creating a subdirectory and moving the logs to the subdirectory a couple of times, but the error didn't occur. Is there any way I can reproduce this on my machine?

tass11 commented 4 years ago

Could not reproduce it reliably in lab conditions, but it happens consistently under certain circumstances: 1) This is Oracle Cloud (OCI) mounted volume, not physical HW. Don't know the exact details, the FS is ext4. 2) The amount of log files watched is approx 500. On lesser loads, this never happened. 3) Logs are moved to a subdir by use of the following command: find ..path../logs/ -maxdepth 1 -type f -exec mv -n -t ..same..path../logs/subdir {} +" And both . and the subdir have exact same owner and group and exact same 755 perms.

Funny thing is, I saw the exact same error before, executing similar scenario on NFS volume. Then I simply attributed the problem to NFS and pushed to have logs stored "locally". Now, I see the same problem on ext4.

I am willing to help with chasing down the problem, please ask for details, I'll try to provide answers.

tass11 commented 4 years ago

And one more detail, if it helps. It is not related to simultaneous writes to files being moved, I have certain proof that the last update was in some minutes time before the move and the error occurred. Not even same minute, solid time interval.

fstab commented 4 years ago

Thanks for the details. I'll set up a similar scenario with a network drive over the weekend and see if I can reproduce this.

fstab commented 4 years ago

Set-up for reproducing is up and running, but so far the error didn't reproduce. I'll keep it running for a few days.

Meanwhile I'll look into not terminating grok_exporter when the error occurs.

tass11 commented 4 years ago

Thank you for your efforts. I'd not be so optimistic about reproducing the issue, though.

Anyways, looking forward for the fix. Re-syncing the dir contents upon receiving such error would be ideal, but simply ignoring it would do.

fstab commented 4 years ago

Is there any chance you can build grok_exporter from source and run this in your environment? I added an improved error message on the master branch to figure out which file system event causes the issue. I'm hoping that if we can figure out which event causes the issue, we can handle that specific event.

BTW my reproducer still runs as expected without any error.

tass11 commented 4 years ago

The reproducer should have fired by now. Something in the scenario must be missing.

We are running lots of ephemeral environments, so there is no telling which one reproduces this problem, if at all. I could analyze the logs once I have spare time, and introduce the master version to the most probable location.

Don't know how to put it, but. If the scenario and the problem fall into "it should not happen"category, then it would probably be safe to ignore it. Could you just do the logging and the ignore parts? I hate being responsible for one heisenbug that I can't even explain, to both sides :)

dmitrytucha commented 3 years ago

Hi! I have the same issue. In the grok_exporter log I see error reading log lines: watch list inconsistent: cannot find directory for file system event "<nil>" with watch descriptor -1 in the list of watched directories [Wd=1 ("/logs/")]

mocart2 commented 3 months ago

Hello, we have same problem too, but in 3 nodes with same config(all the same), only one reproduce this error:

Jul 29 13:44:42 example-srv grok_exporter[3985517]: error reading log lines: watch list inconsistent: cannot find directory for file system event "<nil>: " with watch descriptor -1 in the list of watched directories [Wd=1 ("/opt/myapp/var/log")]
Jul 29 13:44:42 example-srv systemd[1]: grok_exporter.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 29 13:44:42 example-srv systemd[1]: grok_exporter.service: Failed with result 'exit-code'.
Jul 29 13:44:42 example-srv systemd[1]: grok_exporter.service: Consumed 3min 20.565s CPU time.
Jul 29 13:44:42 example-srv systemd[1]: grok_exporter.service: Scheduled restart job, restart counter is at 2.
Jul 29 13:44:42 example-srv systemd[1]: Stopped Prometheus Grok Exporter.
Jul 29 13:44:42 example-srv systemd[1]: grok_exporter.service: Consumed 3min 20.565s CPU time.
Jul 29 13:44:42 example-srv systemd[1]: Started Prometheus Grok Exporter.
Jul 29 13:44:42 example-srv grok_exporter[3993402]: Starting server on http://example-srv:9144/metrics
Jul 29 13:44:43 example-srv grok_exporter[3993402]: error reading log lines: watch list inconsistent: cannot find directory for file system event "<nil>: " with watch descriptor -1 in the list of watched directories [Wd=1 ("/opt/myapp/var/log")]
Jul 29 13:44:43 example-srv systemd[1]: grok_exporter.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 29 13:44:43 example-srv systemd[1]: grok_exporter.service: Failed with result 'exit-code'.
Jul 29 13:44:43 example-srv systemd[1]: grok_exporter.service: Consumed 3.436s CPU time.
Jul 29 13:44:43 example-srv systemd[1]: grok_exporter.service: Scheduled restart job, restart counter is at 3.
Jul 29 13:44:43 example-srv systemd[1]: Stopped Prometheus Grok Exporter.
Jul 29 13:44:43 example-srv systemd[1]: grok_exporter.service: Consumed 3.436s CPU time.
Jul 29 13:44:43 example-srv systemd[1]: Started Prometheus Grok Exporter.
Jul 29 13:44:43 example-srv grok_exporter[3993425]: Starting server on http://example-srv:9144/metrics
Jul 29 13:44:44 example-srv grok_exporter[3993425]: error reading log lines: watch list inconsistent: cannot find directory for file system event "<nil>: " with watch descriptor -1 in the list of watched directories [Wd=1 ("/opt/myapp/var/log")]
Jul 29 13:44:44 example-srv systemd[1]: grok_exporter.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 29 13:44:44 example-srv systemd[1]: grok_exporter.service: Failed with result 'exit-code'.
Jul 29 13:44:44 example-srv systemd[1]: grok_exporter.service: Consumed 1.884s CPU time.
Jul 29 13:44:44 example-srv systemd[1]: grok_exporter.service: Scheduled restart job, restart counter is at 4.
Jul 29 13:44:44 example-srv systemd[1]: Stopped Prometheus Grok Exporter.
Jul 29 13:44:44 example-srv systemd[1]: grok_exporter.service: Consumed 1.884s CPU time.
Jul 29 13:44:44 example-srv systemd[1]: Started Prometheus Grok Exporter.
Jul 29 13:44:45 example-srv systemd[1]: grok_exporter.service: Consumed 1.695s CPU time.
Jul 29 13:44:45 example-srv systemd[1]: grok_exporter.service: Scheduled restart job, restart counter is at 5.
Jul 29 13:44:45 example-srv systemd[1]: Stopped Prometheus Grok Exporter.
Jul 29 13:44:45 example-srv systemd[1]: grok_exporter.service: Consumed 1.695s CPU time.