apache / skywalking

APM, Application Performance Monitoring System
https://skywalking.apache.org/
Apache License 2.0
23.81k stars 6.52k forks source link

k8s service collection error #8026

Closed 844700118 closed 2 years ago

844700118 commented 2 years ago

Search before asking

Apache SkyWalking Component

OAP server (apache/skywalking)

What happened

1. The sub-module "cluster" "node" under the k8s module of the dashboard has data, which is normal, but the sub-module "service" has no data displayed

2.Server error log [root@k8s-master ~/apache-skywalking-apm-bin-es7]#tail -f logs/skywalking-oap-server.log

......
2021-10-27 19:00:32,988 - io.kubernetes.client.informer.cache.ReflectorRunnable - 79 [controller-reflector-io.kubernetes.client.openapi.models.V1Pod-1] INFO  [] - class io.kubernetes.client.openapi.models.V1Pod#Start listing and watching...
2021-10-27 19:00:32,988 - io.kubernetes.client.informer.cache.ReflectorRunnable - 79 [controller-reflector-io.kubernetes.client.openapi.models.V1Service-1] INFO  [] - class io.kubernetes.client.openapi.models.V1Service#Start listing and watching...
2021-10-27 19:00:33,988 - io.kubernetes.client.informer.cache.ReflectorRunnable - 79 [controller-reflector-io.kubernetes.client.openapi.models.V1Pod-1] INFO  [] - class io.kubernetes.client.openapi.models.V1Pod#Start listing and watching...
2021-10-27 19:00:33,988 - io.kubernetes.client.informer.cache.ReflectorRunnable - 79 [controller-reflector-io.kubernetes.client.openapi.models.V1Service-1] INFO  [] - class io.kubernetes.client.openapi.models.V1Service#Start listing and watching...
2021-10-27 19:00:34,463 - org.apache.skywalking.oap.meter.analyzer.dsl.Expression - 88 [grpcServerPool-1-thread-17] ERROR [] - failed to run "(100 - ((node_memory_SwapFree_bytes * 100) / node_memory_SwapTotal_bytes)).tag({tags -> tags.node_identifier_host_name = 'vm::' + tags.node_identifier_host_name}).service(['node_identifier_host_name'])"
java.lang.IllegalArgumentException: null
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:128) ~[guava-28.1-jre.jar:?]
        at org.apache.skywalking.oap.meter.analyzer.dsl.SampleFamily.build(SampleFamily.java:78) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.meter.analyzer.dsl.SampleFamily.newValue(SampleFamily.java:487) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.meter.analyzer.dsl.SampleFamily.div(SampleFamily.java:193) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.meter.analyzer.dsl.SampleFamily$div$9.call(Unknown Source) ~[?:?]
        at Script1.run(Script1.groovy:1) ~[?:?]
        at org.apache.skywalking.oap.meter.analyzer.dsl.Expression.run(Expression.java:77) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.meter.analyzer.Analyzer.analyse(Analyzer.java:115) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.meter.analyzer.MetricConvert.toMeter(MetricConvert.java:73) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.meter.analyzer.prometheus.PrometheusMetricConverter.toMeter(PrometheusMetricConverter.java:84) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.server.receiver.otel.oc.OCMetricHandler$1.lambda$onNext$6(OCMetricHandler.java:79) ~[otel-receiver-plugin-8.7.0.jar:8.7.0]
        at java.util.ArrayList.forEach(ArrayList.java:1259) [?:1.8.0_262]
        at org.apache.skywalking.oap.server.receiver.otel.oc.OCMetricHandler$1.onNext(OCMetricHandler.java:79) [otel-receiver-plugin-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.server.receiver.otel.oc.OCMetricHandler$1.onNext(OCMetricHandler.java:61) [otel-receiver-plugin-8.7.0.jar:8.7.0]
        at io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:249) [grpc-stub-1.32.1.jar:1.32.1]
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:309) [grpc-core-1.32.1.jar:1.32.1]
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:292) [grpc-core-1.32.1.jar:1.32.1]
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:782) [grpc-core-1.32.1.jar:1.32.1]
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [grpc-core-1.32.1.jar:1.32.1]
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) [grpc-core-1.32.1.jar:1.32.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]

3.k8s indicator monitoring is normal [root@master131 ~]# kubectl logs -f -n kube-system kube-state-metrics-0

I1027 10:01:11.984341       1 main.go:106] Using default resources
I1027 10:01:12.128159       1 main.go:118] Using all namespace
I1027 10:01:12.128166       1 main.go:139] metric allow-denylisting: Excluding the following lists that were on denylist: 
W1027 10:01:12.128948       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1027 10:01:12.212866       1 main.go:241] Testing communication with server
I1027 10:01:12.303482       1 main.go:246] Running with Kubernetes cluster version: v1.20. git version: v1.20.2. git tree state: clean. commit: faecb196815e248d3ecfb03c680a4507229c2a56. platform: linux/amd64
I1027 10:01:12.303518       1 main.go:248] Communication with server successful
I1027 10:01:12.303837       1 main.go:204] Starting metrics server: [::]:8080
I1027 10:01:12.303864       1 metrics_handler.go:102] Autosharding enabled with pod=kube-state-metrics-0 pod_namespace=kube-system
I1027 10:01:12.303886       1 metrics_handler.go:103] Auto detecting sharding settings.
I1027 10:01:12.303881       1 main.go:193] Starting kube-state-metrics self metrics server: [::]:8081
I1027 10:01:12.304116       1 main.go:64] levelinfomsgTLS is disabled.http2false
I1027 10:01:12.304203       1 main.go:64] levelinfomsgTLS is disabled.http2false
I1027 10:01:12.363206       1 builder.go:190] Active resources: certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments

4. Data collection OpenTelemetry is normal [root@master131 ~]# kubectl logs -f otel-collector-7bb5b98564-stvdg

2021-10-27T11:34:43.650Z        info    service/collector.go:262        Starting otelcol...     {"Version": "v0.29.0", "NumCPU": 28}
2021-10-27T11:34:43.657Z        info    service/collector.go:322        Using memory ballast    {"MiBs": 683}
2021-10-27T11:34:43.657Z        info    service/collector.go:170        Setting up own telemetry...
2021-10-27T11:34:43.659Z        info    service/telemetry.go:99 Serving Prometheus metrics      {"address": ":8888", "level": 0, "service.instance.id": "9903e31e-d72f-4222-a2a8-32c94a0836db"}
2021-10-27T11:34:43.659Z        info    service/collector.go:205        Loading configuration...
2021-10-27T11:34:43.662Z        info    service/collector.go:221        Applying configuration...
2021-10-27T11:34:43.662Z        info    builder/exporters_builder.go:274        Exporter was built.     {"kind": "exporter", "exporter": "opencensus"}
2021-10-27T11:34:43.662Z        info    builder/exporters_builder.go:274        Exporter was built.     {"kind": "exporter", "exporter": "logging"}
2021-10-27T11:34:43.662Z        info    builder/pipelines_builder.go:204        Pipeline was built.     {"pipeline_name": "metrics", "pipeline_datatype": "metrics"}
2021-10-27T11:34:43.662Z        info    builder/receivers_builder.go:230        Receiver was built.     {"kind": "receiver", "name": "prometheus", "datatype": "metrics"}
2021-10-27T11:34:43.662Z        info    service/service.go:137  Starting extensions...
2021-10-27T11:34:43.662Z        info    builder/extensions_builder.go:53        Extension is starting...        {"kind": "extension", "name": "health_check"}
2021-10-27T11:34:43.662Z        info    healthcheckextension/healthcheckextension.go:41 Starting health_check extension {"kind": "extension", "name": "health_check", "config": {"Port":0,"TCPAddr":{"Endpoint":"0.0.0.0:13133"}}}
2021-10-27T11:34:43.662Z        info    builder/extensions_builder.go:59        Extension started.      {"kind": "extension", "name": "health_check"}
2021-10-27T11:34:43.662Z        info    builder/extensions_builder.go:53        Extension is starting...        {"kind": "extension", "name": "zpages"}
2021-10-27T11:34:43.662Z        info    zpagesextension/zpagesextension.go:42   Register Host's zPages  {"kind": "extension", "name": "zpages"}
2021-10-27T11:34:43.662Z        info    zpagesextension/zpagesextension.go:55   Starting zPages extension       {"kind": "extension", "name": "zpages", "config": {"TCPAddr":{"Endpoint":"localhost:55679"}}}
2021-10-27T11:34:43.662Z        info    builder/extensions_builder.go:59        Extension started.      {"kind": "extension", "name": "zpages"}
2021-10-27T11:34:43.662Z        info    service/service.go:182  Starting exporters...
2021-10-27T11:34:43.662Z        info    builder/exporters_builder.go:92 Exporter is starting... {"kind": "exporter", "name": "opencensus"}
2021-10-27T11:34:43.662Z        info    builder/exporters_builder.go:97 Exporter started.       {"kind": "exporter", "name": "opencensus"}
2021-10-27T11:34:43.662Z        info    builder/exporters_builder.go:92 Exporter is starting... {"kind": "exporter", "name": "logging"}
2021-10-27T11:34:43.662Z        info    builder/exporters_builder.go:97 Exporter started.       {"kind": "exporter", "name": "logging"}
2021-10-27T11:34:43.662Z        info    service/service.go:187  Starting processors...
2021-10-27T11:34:43.662Z        info    builder/pipelines_builder.go:51 Pipeline is starting... {"pipeline_name": "metrics", "pipeline_datatype": "metrics"}
2021-10-27T11:34:43.662Z        info    builder/pipelines_builder.go:62 Pipeline is started.    {"pipeline_name": "metrics", "pipeline_datatype": "metrics"}
2021-10-27T11:34:43.662Z        info    service/service.go:192  Starting receivers...
2021-10-27T11:34:43.662Z        info    builder/receivers_builder.go:70 Receiver is starting... {"kind": "receiver", "name": "prometheus"}
2021-10-27T11:34:43.663Z        info    kubernetes/kubernetes.go:282    Using pod service account via in-cluster config {"kind": "receiver", "name": "prometheus", "level": "info", "discovery": "kubernetes"}
2021-10-27T11:34:43.679Z        info    kubernetes/kubernetes.go:282    Using pod service account via in-cluster config {"kind": "receiver", "name": "prometheus", "level": "info", "discovery": "kubernetes"}
2021-10-27T11:34:43.680Z        info    discovery/manager.go:195        Starting provider       {"kind": "receiver", "name": "prometheus", "level": "debug", "provider": "static/0", "subs": "[jvm-node-exporter]"}
2021-10-27T11:34:43.680Z        info    discovery/manager.go:195        Starting provider       {"kind": "receiver", "name": "prometheus", "level": "debug", "provider": "kubernetes/1", "subs": "[kubernetes-cadvisor]"}
2021-10-27T11:34:43.680Z        info    discovery/manager.go:195        Starting provider       {"kind": "receiver", "name": "prometheus", "level": "debug", "provider": "kubernetes/2", "subs": "[kube-state-metrics]"}
2021-10-27T11:34:43.680Z        info    builder/receivers_builder.go:75 Receiver started.       {"kind": "receiver", "name": "prometheus"}
2021-10-27T11:34:43.680Z        info    discovery/manager.go:213        Discoverer channel closed       {"kind": "receiver", "name": "prometheus", "level": "debug", "provider": "static/0"}
2021-10-27T11:34:43.680Z        info    healthcheck/handler.go:129      Health Check state change       {"kind": "extension", "name": "health_check", "status": "ready"}
2021-10-27T11:34:43.680Z        info    service/collector.go:182        Everything is ready. Begin running and processing data.
2021-10-27T11:34:50.493Z        INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 170}
2021-10-27T11:34:50.493Z        INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 170}
2021-10-27T11:34:50.708Z        INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 70}
2021-10-27T11:34:51.930Z        INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 46}
2021-10-27T11:34:52.944Z        INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 70}

5. I am not sure if the OpenTelemetry configuration is correct [root@master131 ~]# vi ./otel-collector-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-conf
  labels:
    app: opentelemetry
    component: otel-collector-conf
  namespace: default
data:
  otel-collector-config: |
    #1. Data export
    receivers:
      prometheus:
        config:
          global:
            scrape_interval: 5s
            evaluation_interval: 5s
          scrape_configs:
            #Collect jvm
            - job_name: 'jvm-node-exporter'
              static_configs:
                - targets: ['192.168.1.131:9110']
            #Collect k8s
            - job_name: 'kubernetes-cadvisor'
              scheme: https
              tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              kubernetes_sd_configs:
              - role: node
              relabel_configs:
              - action: labelmap
                regex: __meta_kubernetes_node_label_(.+)
              - source_labels: []       # relabel the cluster name 
                target_label: cluster
                replacement: k8s-131
              - target_label: __address__
                replacement: kubernetes.default.svc:443
              - source_labels: [__meta_kubernetes_node_name]
                regex: (.+)
                target_label: __metrics_path__
                replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
              - source_labels: [instance]   # relabel the node name 
                separator: ;
                regex: (.+)
                target_label: node
                replacement: $$1
                action: replace
            - job_name: kube-state-metrics
              kubernetes_sd_configs:
              - role: endpoints
              relabel_configs:
              - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
                regex: kube-state-metrics
                replacement: $$1
                action: keep
              - action: labelmap
                regex: __meta_kubernetes_service_label_(.+)
              - source_labels: []  # relabel the cluster name 
                target_label: cluster
                replacement: k8s-131
    #2.Workflow, preprocessing work done before exporting the data source
    processors:
      batch:
    #Self-health check
    extensions:
      health_check: {}
      zpages: {}
    #3.data import
    exporters:
      opencensus:
        endpoint: "192.168.1.214:11800"
        insecure: true
      logging:
        logLevel: info
    service:
      extensions: [health_check, zpages]
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [batch]
          exporters: [opencensus,logging]

---

apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  labels:
    app: opentelemetry
    component: otel-collector
  namespace: default
spec:
  type: NodePort
  ports:
  - name: metrics 
    port: 8888
    targetPort: 8888
    nodePort: 58888
  selector:
    component: otel-collector

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  labels:
    app: opentelemetry
    component: otel-collector
  namespace: default
spec:
  selector:
    matchLabels:
      app: opentelemetry
      component: otel-collector
  minReadySeconds: 5
  progressDeadlineSeconds: 120
  replicas: 1 
  template:
    metadata:
      labels:
        app: opentelemetry
        component: otel-collector
    spec:
      serviceAccountName: prometheus
      containers:
      - command:
          - "/otelcol"
          - "--config=/conf/otel-collector-config.yaml"
          - "--log-level=info"
          - "--mem-ballast-size-mib=683"
        image: otel/opentelemetry-collector:0.29.0
        name: otel-collector
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 200m
            memory: 400Mi
        ports:
        - containerPort: 55679 # ZPages endpoint
        - containerPort: 55680 # ZPages endpoint
        - containerPort: 4317  # OpenTelemetry receiver
        - containerPort: 8888  # querying metrics
        volumeMounts:
        - name: otel-collector-config-vol
          mountPath: /conf
      volumes:
        - configMap:
            name: otel-collector-conf
            items:
              - key: otel-collector-config
                path: otel-collector-config.yaml
          name: otel-collector-config-vol

What you expected to happen

It may be a problem with the OpenTelemetry Collector configuration, but I don't know where the problem is. Ask for help.

How to reproduce

The OpenTelemetry Collector configuration file is described above.

Anything else

No response

Are you willing to submit PR?

Code of Conduct

wu-sheng commented 2 years ago

Are you willing to submit PR? Yes I am willing to submit a PR!

Are you sure? If so, this issue will be assigned to yourself. We will wait for pull request only.