VictoriaMetrics / helm-charts

Helm charts for VictoriaMetrics, VictoriaLogs and ecosystem
https://victoriametrics.github.io/helm-charts/
Apache License 2.0
329 stars 324 forks source link

standard dashboard work incompletly on rke2 with cilium #1376

Open didlawowo opened 5 months ago

didlawowo commented 5 months ago

Describe the bug

i'm using helm with vm K8s stack chart grafana come with dashboard but some are not working correctly

image image image

To Reproduce

just install stack k8s

Version

latest

Logs

No response

Screenshots

No response

Used command-line flags

No response

Additional information

No response

dmitryk-dk commented 5 months ago

Hi @didlawowo ! What of the dashboard are you using? VictoriaMetrics has its dashboards, and you can find them here.

dmitryk-dk commented 5 months ago

If you want to use this dashboard you should check the metrics which are used in that dashboard and probably correct them

didlawowo commented 5 months ago

i'm using dashboard provided by the k8s vm stack

dmitryk-dk commented 5 months ago

i'm using dashboard provided by the k8s vm stack

In the k8s stack, VictoriaMetrics exposes the dashboards that I shared before. As far as I can see from the domain you are using tailscale. So I think you should check which stack you are using

dmitryk-dk commented 5 months ago

Hi @didlawowo ! I reproduces your bug, need to check how to fix it

dmitryk-dk commented 5 months ago

Hi @didlawowo ! Can you check the vmagent on your installation? It will show you where is the problem with scrapes targets. If you fix it you should see all information into your dashboards.

Screenshot 2024-03-25 at 12 47 10
didlawowo commented 5 months ago

thx you but could be more specific ? i'm not sure to understand

didlawowo commented 5 months ago

i'm using dashboard provided by the k8s vm stack

In the k8s stack, VictoriaMetrics exposes the dashboards that I shared before. As far as I can see from the domain you are using tailscale. So I think you should check which stack you are using

the tailscale service its just for exposing. no impact

dmitryk-dk commented 5 months ago

thx you but could be more specific ? i'm not sure to understand

Hi! We found a bug, and the dashboard should be updated. It happens because some kubernetes setup may missing image or container label https://github.com/dotdc/grafana-dashboards-kubernetes/issues/18#issuecomment-1218059507

As a small workaround you can install you kubelet with next configuration and check what the panels will have no data.

kubelet:
  spec:
    # drop high cardinality label and useless metrics for cadvisor and kubelet
    metricRelabelConfigs:
      - action: labeldrop
        regex: (uid)
      - action: labeldrop
        regex: (id|name)
      - action: drop
        source_labels: [__name__]
        regex: (rest_client_request_duration_seconds_bucket|rest_client_request_duration_seconds_sum|rest_client_request_duration_seconds_count)
      - target_label: image
        replacement: placeholder
Screenshot 2024-03-27 at 16 47 27
dmitryk-dk commented 5 months ago

@Haleygo or @zekker6, can you take a look into the issue, please?

didlawowo commented 5 months ago

nice answer, i'm not sure how to config kubelet in rke2

https://docs.rke2.io/reference/windows_agent_config?_highlight=kubelet&_highlight=conf#windows-rke2-agent-cli-help

i take a look

AndrewChubatiuk commented 5 months ago

hey @didlawowo what kubernetes version are you on? what args are you passing to rke2 agent now?

didlawowo commented 5 months ago

i'm using rke2

with these parameters

write-kubeconfig-mode: "0600"
server: https://192.168.1.200:9345
token: 
tls-san:
  - "192.168.1.200"
# Make a etcd snapshot every 6 hours
etcd-snapshot-schedule-cron: "0 */6 * * *"
# Keep 56 etcd snapshorts (equals to 2 weeks with 6 a day)
etcd-snapshot-retention: 56
etcd-expose-metrics: true
cni:
  - cilium
disable:
  - rke2-ingress-nginx
  - rke2-canal
  - rke2-kube-proxy
disable-cloud-controller: true
disable-kube-proxy: true

v1.27.12+rke2r1

AndrewChubatiuk commented 4 hours ago

hey @didlawowo finally found time to test a case in this issue locally as we are not using RKE2 at all was able to reproduce issues with scraping kube-scheduler, kube-controller-manager and etcd metrics. All these services required additional configurations to become scrapable by vmagent

  1. In /etc/rancher/rke2/config.yaml had to add several values
    etcd-expose-metrics: true
    kube-scheduler-arg:
    - bind-address=0.0.0.0               # haven't checked how to pass there address from pod metadata instead
    kube-controller-manager-arg:
    - bind-address=0.0.0.0               # haven't checked how to pass there address from pod metadata instead
  2. additional values for k8s-stack
    kubeControllerManager:
    vmScrape:
    spec:
      endpoints:
        - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
          port: http-metrics
          scheme: https
          tlsConfig:
            caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            serverName: localhost          # maybe I've misconfigured something, but there was an issue until this value was set
            insecureSkipVerify: true        # haven't tried to pass automatically generated certificates in agent nodes
    kubeEtcd:
    service:
    port: 2381
    targetPort: 2381
    vmScrape:
    spec:
      endpoints:  
        - port: http-metrics
          scheme: http
    kubeScheduler:
    vmScrape:
    spec:
      endpoints:
        - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
          tlsConfig:
            caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            serverName: 127.0.0.1
            insecureSkipVerify: true                # haven't tried to pass automatically generated certificates in agent nodes