deis / monitor

Monitoring for Deis Workflow
https://deis.com
MIT License
22 stars 32 forks source link

Need the ability to turn off disk metrics in telegraf #151

Closed jchauncey closed 7 years ago

jchauncey commented 7 years ago

For some reason you cant collect disk metrics on a coreos box.

sstarcher commented 7 years ago

What issue are you seeing I collect disk stats perfectly fine with telegraf on CoreOS.

jchauncey commented 7 years ago

This might be an issue with my cluster but I cant seem to run df on my cluster and telegraf bombs out too

sstarcher commented 7 years ago

What version of CoreOS are you running? I tend to run stable and I'm currently on 1122.2.0 df works perfectly fine.

felixbuenemann commented 7 years ago

@sstarcher @jchauncey is probably talking about these errors:

2016/10/25 00:27:10 E! ERROR in input [inputs.disk]: error getting disk usage info: too many levels of symbolic links
2016/10/25 00:27:20 E! ERROR in input [inputs.disk]: error getting disk usage info: too many levels of symbolic links
2016/10/25 00:27:30 E! ERROR in input [inputs.disk]: error getting disk usage info: too many levels of symbolic links

My fix on CoreOS stable for that was to make sure that binfmt_misc was mounted before docker via a docker.service systemd drop-in:

coreos:
  units:
    - name: docker.service
      drop-ins:
        - name: 60-mount-binfmt-misc.conf
          content: |
            [Unit]
            Wants=proc-sys-fs-binfmt_misc.mount
            After=proc-sys-fs-binfmt_misc.mount

My git commit message for that change said:

Add workaround for binfmt_misc automount issues

Without this fix storage monitoring in telegraf is not working because the try to trigger the automount of /rootfs/proc/sys/fs/binfmt_misc fails and causes a "Too many levels of symbolic links" error.

Unfortunately that doesn't seem to be working on CoreOS beta (1185.2.0) anymore, which my k8s 1.4.x clusters are based on, as the logs above show.

I haven't yet found the time to attach systrace or sysdig to telegraf to see if it is indeed the same issue.

sstarcher commented 7 years ago

@felixbuenemann we also run the same thing, but we are on stable which is still working.

bacongobbler commented 7 years ago

Is there anything we can do on our end to fix this or shall we close this as a known issue with certain CoreOS releases?

jchauncey commented 7 years ago

The problem @felixbuenemann talked about was different than the issue I was seeing. I fixed my problem by just asking for metrics on /.