apache / skywalking

APM, Application Performance Monitoring System
https://skywalking.apache.org/
Apache License 2.0
23.81k stars 6.52k forks source link

仪表盘k8s模块下的子模块cluster/node有数据, 但子模块service没有数据显示 #8025

Closed 844700118 closed 2 years ago

844700118 commented 2 years ago

Search before asking

Apache SkyWalking Component

OAP server (apache/skywalking)

What happened

1.仪表盘k8s模块下的子模块cluster/node有数据, 但子模块service没有数据显示

image image

2.server端错误日志 [root@k8s-master ~/apache-skywalking-apm-bin-es7]#tail -f logs/skywalking-oap-server.log

......
2021-10-27 19:00:32,988 - io.kubernetes.client.informer.cache.ReflectorRunnable - 79 [controller-reflector-io.kubernetes.client.openapi.models.V1Pod-1] INFO  [] - class io.kubernetes.client.openapi.models.V1Pod#Start listing and watching...
2021-10-27 19:00:32,988 - io.kubernetes.client.informer.cache.ReflectorRunnable - 79 [controller-reflector-io.kubernetes.client.openapi.models.V1Service-1] INFO  [] - class io.kubernetes.client.openapi.models.V1Service#Start listing and watching...
2021-10-27 19:00:33,988 - io.kubernetes.client.informer.cache.ReflectorRunnable - 79 [controller-reflector-io.kubernetes.client.openapi.models.V1Pod-1] INFO  [] - class io.kubernetes.client.openapi.models.V1Pod#Start listing and watching...
2021-10-27 19:00:33,988 - io.kubernetes.client.informer.cache.ReflectorRunnable - 79 [controller-reflector-io.kubernetes.client.openapi.models.V1Service-1] INFO  [] - class io.kubernetes.client.openapi.models.V1Service#Start listing and watching...
2021-10-27 19:00:34,463 - org.apache.skywalking.oap.meter.analyzer.dsl.Expression - 88 [grpcServerPool-1-thread-17] ERROR [] - failed to run "(100 - ((node_memory_SwapFree_bytes * 100) / node_memory_SwapTotal_bytes)).tag({tags -> tags.node_identifier_host_name = 'vm::' + tags.node_identifier_host_name}).service(['node_identifier_host_name'])"
java.lang.IllegalArgumentException: null
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:128) ~[guava-28.1-jre.jar:?]
        at org.apache.skywalking.oap.meter.analyzer.dsl.SampleFamily.build(SampleFamily.java:78) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.meter.analyzer.dsl.SampleFamily.newValue(SampleFamily.java:487) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.meter.analyzer.dsl.SampleFamily.div(SampleFamily.java:193) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.meter.analyzer.dsl.SampleFamily$div$9.call(Unknown Source) ~[?:?]
        at Script1.run(Script1.groovy:1) ~[?:?]
        at org.apache.skywalking.oap.meter.analyzer.dsl.Expression.run(Expression.java:77) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.meter.analyzer.Analyzer.analyse(Analyzer.java:115) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.meter.analyzer.MetricConvert.toMeter(MetricConvert.java:73) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.meter.analyzer.prometheus.PrometheusMetricConverter.toMeter(PrometheusMetricConverter.java:84) ~[meter-analyzer-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.server.receiver.otel.oc.OCMetricHandler$1.lambda$onNext$6(OCMetricHandler.java:79) ~[otel-receiver-plugin-8.7.0.jar:8.7.0]
        at java.util.ArrayList.forEach(ArrayList.java:1259) [?:1.8.0_262]
        at org.apache.skywalking.oap.server.receiver.otel.oc.OCMetricHandler$1.onNext(OCMetricHandler.java:79) [otel-receiver-plugin-8.7.0.jar:8.7.0]
        at org.apache.skywalking.oap.server.receiver.otel.oc.OCMetricHandler$1.onNext(OCMetricHandler.java:61) [otel-receiver-plugin-8.7.0.jar:8.7.0]
        at io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:249) [grpc-stub-1.32.1.jar:1.32.1]
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:309) [grpc-core-1.32.1.jar:1.32.1]
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:292) [grpc-core-1.32.1.jar:1.32.1]
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:782) [grpc-core-1.32.1.jar:1.32.1]
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [grpc-core-1.32.1.jar:1.32.1]
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) [grpc-core-1.32.1.jar:1.32.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]

3.k8s指标监控正常 [root@master131 ~]# kubectl logs -f -n kube-system kube-state-metrics-0

I1027 10:01:11.984341       1 main.go:106] Using default resources
I1027 10:01:12.128159       1 main.go:118] Using all namespace
I1027 10:01:12.128166       1 main.go:139] metric allow-denylisting: Excluding the following lists that were on denylist: 
W1027 10:01:12.128948       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1027 10:01:12.212866       1 main.go:241] Testing communication with server
I1027 10:01:12.303482       1 main.go:246] Running with Kubernetes cluster version: v1.20. git version: v1.20.2. git tree state: clean. commit: faecb196815e248d3ecfb03c680a4507229c2a56. platform: linux/amd64
I1027 10:01:12.303518       1 main.go:248] Communication with server successful
I1027 10:01:12.303837       1 main.go:204] Starting metrics server: [::]:8080
I1027 10:01:12.303864       1 metrics_handler.go:102] Autosharding enabled with pod=kube-state-metrics-0 pod_namespace=kube-system
I1027 10:01:12.303886       1 metrics_handler.go:103] Auto detecting sharding settings.
I1027 10:01:12.303881       1 main.go:193] Starting kube-state-metrics self metrics server: [::]:8081
I1027 10:01:12.304116       1 main.go:64] levelinfomsgTLS is disabled.http2false
I1027 10:01:12.304203       1 main.go:64] levelinfomsgTLS is disabled.http2false
I1027 10:01:12.363206       1 builder.go:190] Active resources: certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments

4.数据收集OpenTelemetry正常 [root@master131 ~]# kubectl logs -f otel-collector-7bb5b98564-stvdg

2021-10-27T11:34:43.650Z        info    service/collector.go:262        Starting otelcol...     {"Version": "v0.29.0", "NumCPU": 28}
2021-10-27T11:34:43.657Z        info    service/collector.go:322        Using memory ballast    {"MiBs": 683}
2021-10-27T11:34:43.657Z        info    service/collector.go:170        Setting up own telemetry...
2021-10-27T11:34:43.659Z        info    service/telemetry.go:99 Serving Prometheus metrics      {"address": ":8888", "level": 0, "service.instance.id": "9903e31e-d72f-4222-a2a8-32c94a0836db"}
2021-10-27T11:34:43.659Z        info    service/collector.go:205        Loading configuration...
2021-10-27T11:34:43.662Z        info    service/collector.go:221        Applying configuration...
2021-10-27T11:34:43.662Z        info    builder/exporters_builder.go:274        Exporter was built.     {"kind": "exporter", "exporter": "opencensus"}
2021-10-27T11:34:43.662Z        info    builder/exporters_builder.go:274        Exporter was built.     {"kind": "exporter", "exporter": "logging"}
2021-10-27T11:34:43.662Z        info    builder/pipelines_builder.go:204        Pipeline was built.     {"pipeline_name": "metrics", "pipeline_datatype": "metrics"}
2021-10-27T11:34:43.662Z        info    builder/receivers_builder.go:230        Receiver was built.     {"kind": "receiver", "name": "prometheus", "datatype": "metrics"}
2021-10-27T11:34:43.662Z        info    service/service.go:137  Starting extensions...
2021-10-27T11:34:43.662Z        info    builder/extensions_builder.go:53        Extension is starting...        {"kind": "extension", "name": "health_check"}
2021-10-27T11:34:43.662Z        info    healthcheckextension/healthcheckextension.go:41 Starting health_check extension {"kind": "extension", "name": "health_check", "config": {"Port":0,"TCPAddr":{"Endpoint":"0.0.0.0:13133"}}}
2021-10-27T11:34:43.662Z        info    builder/extensions_builder.go:59        Extension started.      {"kind": "extension", "name": "health_check"}
2021-10-27T11:34:43.662Z        info    builder/extensions_builder.go:53        Extension is starting...        {"kind": "extension", "name": "zpages"}
2021-10-27T11:34:43.662Z        info    zpagesextension/zpagesextension.go:42   Register Host's zPages  {"kind": "extension", "name": "zpages"}
2021-10-27T11:34:43.662Z        info    zpagesextension/zpagesextension.go:55   Starting zPages extension       {"kind": "extension", "name": "zpages", "config": {"TCPAddr":{"Endpoint":"localhost:55679"}}}
2021-10-27T11:34:43.662Z        info    builder/extensions_builder.go:59        Extension started.      {"kind": "extension", "name": "zpages"}
2021-10-27T11:34:43.662Z        info    service/service.go:182  Starting exporters...
2021-10-27T11:34:43.662Z        info    builder/exporters_builder.go:92 Exporter is starting... {"kind": "exporter", "name": "opencensus"}
2021-10-27T11:34:43.662Z        info    builder/exporters_builder.go:97 Exporter started.       {"kind": "exporter", "name": "opencensus"}
2021-10-27T11:34:43.662Z        info    builder/exporters_builder.go:92 Exporter is starting... {"kind": "exporter", "name": "logging"}
2021-10-27T11:34:43.662Z        info    builder/exporters_builder.go:97 Exporter started.       {"kind": "exporter", "name": "logging"}
2021-10-27T11:34:43.662Z        info    service/service.go:187  Starting processors...
2021-10-27T11:34:43.662Z        info    builder/pipelines_builder.go:51 Pipeline is starting... {"pipeline_name": "metrics", "pipeline_datatype": "metrics"}
2021-10-27T11:34:43.662Z        info    builder/pipelines_builder.go:62 Pipeline is started.    {"pipeline_name": "metrics", "pipeline_datatype": "metrics"}
2021-10-27T11:34:43.662Z        info    service/service.go:192  Starting receivers...
2021-10-27T11:34:43.662Z        info    builder/receivers_builder.go:70 Receiver is starting... {"kind": "receiver", "name": "prometheus"}
2021-10-27T11:34:43.663Z        info    kubernetes/kubernetes.go:282    Using pod service account via in-cluster config {"kind": "receiver", "name": "prometheus", "level": "info", "discovery": "kubernetes"}
2021-10-27T11:34:43.679Z        info    kubernetes/kubernetes.go:282    Using pod service account via in-cluster config {"kind": "receiver", "name": "prometheus", "level": "info", "discovery": "kubernetes"}
2021-10-27T11:34:43.680Z        info    discovery/manager.go:195        Starting provider       {"kind": "receiver", "name": "prometheus", "level": "debug", "provider": "static/0", "subs": "[jvm-node-exporter]"}
2021-10-27T11:34:43.680Z        info    discovery/manager.go:195        Starting provider       {"kind": "receiver", "name": "prometheus", "level": "debug", "provider": "kubernetes/1", "subs": "[kubernetes-cadvisor]"}
2021-10-27T11:34:43.680Z        info    discovery/manager.go:195        Starting provider       {"kind": "receiver", "name": "prometheus", "level": "debug", "provider": "kubernetes/2", "subs": "[kube-state-metrics]"}
2021-10-27T11:34:43.680Z        info    builder/receivers_builder.go:75 Receiver started.       {"kind": "receiver", "name": "prometheus"}
2021-10-27T11:34:43.680Z        info    discovery/manager.go:213        Discoverer channel closed       {"kind": "receiver", "name": "prometheus", "level": "debug", "provider": "static/0"}
2021-10-27T11:34:43.680Z        info    healthcheck/handler.go:129      Health Check state change       {"kind": "extension", "name": "health_check", "status": "ready"}
2021-10-27T11:34:43.680Z        info    service/collector.go:182        Everything is ready. Begin running and processing data.
2021-10-27T11:34:50.493Z        INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 170}
2021-10-27T11:34:50.493Z        INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 170}
2021-10-27T11:34:50.708Z        INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 70}
2021-10-27T11:34:51.930Z        INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 46}
2021-10-27T11:34:52.944Z        INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 70}

5.OpenTelemetry配置我不确定是否有误 [root@master131 ~]# vi ./otel-collector-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-conf
  labels:
    app: opentelemetry
    component: otel-collector-conf
  namespace: default
data:
  otel-collector-config: |
    #1.数据导出
    receivers:
      prometheus:
        config:
          global:
            scrape_interval: 5s
            evaluation_interval: 5s
          scrape_configs:
            #收集jvm
            - job_name: 'jvm-node-exporter'
              static_configs:
                - targets: ['192.168.1.131:9110']
            #收集k8s
            - job_name: 'kubernetes-cadvisor'
              scheme: https
              tls_config:
                ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              kubernetes_sd_configs:
              - role: node
              relabel_configs:
              - action: labelmap
                regex: __meta_kubernetes_node_label_(.+)
              - source_labels: []       # relabel the cluster name 
                target_label: cluster
                replacement: k8s-131
              - target_label: __address__
                replacement: kubernetes.default.svc:443
              - source_labels: [__meta_kubernetes_node_name]
                regex: (.+)
                target_label: __metrics_path__
                replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
              - source_labels: [instance]   # relabel the node name 
                separator: ;
                regex: (.+)
                target_label: node
                replacement: $$1
                action: replace
            - job_name: kube-state-metrics
              kubernetes_sd_configs:
              - role: endpoints
              relabel_configs:
              - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
                regex: kube-state-metrics
                replacement: $$1
                action: keep
              - action: labelmap
                regex: __meta_kubernetes_service_label_(.+)
              - source_labels: []  # relabel the cluster name 
                target_label: cluster
                replacement: k8s-131
    #2.工作流,导出数据源前做的预处理工作
    processors:
      batch:
    #自我健康检测
    extensions:
      health_check: {}
      zpages: {}
    #3.数据导入
    exporters:
      opencensus:
        endpoint: "192.168.1.214:11800"
        insecure: true
      logging:
        logLevel: info
    service:
      extensions: [health_check, zpages]
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [batch]
          exporters: [opencensus,logging]

---

apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  labels:
    app: opentelemetry
    component: otel-collector
  namespace: default
spec:
  type: NodePort
  ports:
  - name: metrics 
    port: 8888
    targetPort: 8888
    nodePort: 58888
  selector:
    component: otel-collector

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  labels:
    app: opentelemetry
    component: otel-collector
  namespace: default
spec:
  selector:
    matchLabels:
      app: opentelemetry
      component: otel-collector
  minReadySeconds: 5
  progressDeadlineSeconds: 120
  replicas: 1 
  template:
    metadata:
      labels:
        app: opentelemetry
        component: otel-collector
    spec:
      serviceAccountName: prometheus
      containers:
      - command:
          - "/otelcol"
          - "--config=/conf/otel-collector-config.yaml"
          - "--log-level=info"
          - "--mem-ballast-size-mib=683"
        image: otel/opentelemetry-collector:0.29.0
        name: otel-collector
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 200m
            memory: 400Mi
        ports:
        - containerPort: 55679 # ZPages endpoint
        - containerPort: 55680 # ZPages endpoint
        - containerPort: 4317  # OpenTelemetry receiver
        - containerPort: 8888  # querying metrics
        volumeMounts:
        - name: otel-collector-config-vol
          mountPath: /conf
      volumes:
        - configMap:
            name: otel-collector-conf
            items:
              - key: otel-collector-config
                path: otel-collector-config.yaml
          name: otel-collector-config-vol

What you expected to happen

可能是OpenTelemetry Collector配置问题, 但不知道问题出在哪里. 请求帮助.

How to reproduce

OpenTelemetry Collector配置文件如上描述.

Anything else

No response

Are you willing to submit PR?

Code of Conduct

wu-sheng commented 2 years ago

All discussions on GitHub have to be in English. If you want the community helps, check the details rather than UI. If something is not there, we don't know what happens unless you provide clue.