canonical / charmed-spark-rock

This repository contains the packaging metadata for creating a ROCK for Apache Spark
1 stars 8 forks source link

Short running Jobs, Drivers or Executors are not visible in the Grafana Dashboard #71

Closed Barteus closed 9 months ago

Barteus commented 9 months ago

Reproduce

  1. Deploy microk8s using charm + grafana agent
  2. Deploy COS
  3. Deploy prometheus gateway + cos-configuration-k8s
  4. Run a job using spark-submit
    spark-client.spark-submit --deploy-mode cluster \
    --num-executors 5 \
    --master k8s://https://0.0.0.0:16443 \
    --conf  spark.app.name=spark-demo  \
    --conf  spark.eventLog.enabled=true  \
    --conf  spark.driver.memory=4g \
    --conf  spark.eventLog.dir=s3a://history-server/spark-events  \
    --conf  spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider  \
    --conf  spark.hadoop.fs.s3a.connection.ssl.enabled=false  \
    --conf  spark.hadoop.fs.s3a.path.style.access=true  \
    --conf  spark.hadoop.fs.s3a.access.key=xxx  \
    --conf  spark.hadoop.fs.s3a.endpoint=http://10.152.183.175  \
    --conf  spark.hadoop.fs.s3a.secret.key=xxx  \
    --conf  spark.history.fs.logDirectory=s3a://history-server/spark-events/  \
    --conf  spark.metrics.conf.driver.sink.prometheus.pushgateway-address=10.152.183.241:9091  \
    --conf  spark.metrics.conf.driver.sink.prometheus.class=org.apache.spark.banzaicloud.metrics.sink.PrometheusSink  \
    --conf  spark.metrics.conf.driver.sink.prometheus.enable-dropwizard-collector=true  \
    --conf  spark.metrics.conf.driver.sink.prometheus.enable-dropwizard-collector=true  \
    --conf  spark.metrics.conf.driver.sink.prometheus.period=1  \
    --conf  spark.metrics.conf.driver.sink.prometheus.metrics-name-capture-regex='([a-z0-9]*_[a-z0-9]*_[a-z0-9]*_)(.+)'  \
    --conf  spark.metrics.conf.driver.sink.prometheus.metrics-name-replacement=\$2  \
    --conf  spark.metrics.conf.executor.sink.prometheus.pushgateway-address=10.152.183.241:9091  \
    --conf  spark.metrics.conf.executor.sink.prometheus.class=org.apache.spark.banzaicloud.metrics.sink.PrometheusSink  \
    --conf  spark.metrics.conf.executor.sink.prometheus.enable-dropwizard-collector=true  \
    --conf  spark.metrics.conf.executor.sink.prometheus.period=1  \
    --conf  spark.metrics.conf.executor.sink.prometheus.metrics-name-capture-regex='([a-z0-9]*_[a-z0-9]*_[a-z0-9]*_)(.+)'  \
    --conf  spark.metrics.conf.executor.sink.prometheus.metrics-name-replacement=\$2  \
    --conf  spark.kubernetes.container.image=ghcr.io/canonical/charmed-spark:3.4-22.04_edge  \
    --class org.apache.spark.examples.SparkPi local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar 10

Actual

Spark jobs running less than a minute or failing before the first minute are not visible in the Grafana dashboard or in Prometheus.

Expected

All started jobs are visible in the grafana dashboard.

Versions

Operating system: Ubuntu 22.04.3 LTS

Juju CLI: 3.3.1

Juju agent: 3.3.1

Charm revision:

$ juju status --relations
Model  Controller     Cloud/Region          Version  SLA          Timestamp
spark  aws-eu-west-1  demo-spark/localhost  3.3.1    unsupported  08:45:20Z

SAAS         Status  Store          URL
cos-traefik  active  aws-eu-west-1  admin/cos.traefik

App                       Version  Status   Scale  Charm                     Channel     Rev  Address         Exposed  Message
s3-integrator                      active       1  s3-integrator             edge         14  10.152.183.19   no       
spark-history-server-k8s           waiting      1  spark-history-server-k8s  3.4/stable   15  10.152.183.249  no       waiting for units to settle down

Unit                         Workload  Agent  Address      Ports  Message
s3-integrator/0*             active    idle   10.1.45.200         
spark-history-server-k8s/0*  blocked   idle   10.1.45.255         Missing S3 relation

Integration provider               Requirer                                 Interface            Type     Message
cos-traefik:ingress                spark-history-server-k8s:ingress         ingress              regular  
s3-integrator:s3-credentials       spark-history-server-k8s:s3-credentials  s3                   regular  
s3-integrator:s3-integrator-peers  s3-integrator:s3-integrator-peers        s3-integrator-peers  peer     

microk8s:

juju status
Model     Controller     Cloud/Region   Version  SLA          Timestamp
microk8s  aws-eu-west-1  aws/eu-west-1  3.3.1    unsupported  09:18:19Z

SAAS              Status  Store          URL
cos-alertmanager  active  aws-eu-west-1  admin/cos.alertmanager-karma-dashboard
cos-grafana       active  aws-eu-west-1  admin/cos.grafana-dashboards
cos-loki          active  aws-eu-west-1  admin/cos.loki-logging
cos-prometheus    active  aws-eu-west-1  admin/cos.prometheus-receive-remote-write

App                Version  Status  Scale  Charm          Channel      Rev  Exposed  Message
grafana-agent-cos           active      1  grafana-agent  latest/edge   28  no       
microk8s           1.29.1   active      1  microk8s       latest/edge  232  yes      node is ready

Unit                    Workload  Agent  Machine  Public address  Ports      Message
microk8s/0*             active    idle   0        3.252.197.189   16443/tcp  node is ready
  grafana-agent-cos/0*  active    idle            3.252.197.189              

Machine  State    Address        Inst id              Base          AZ          Message
0        started  3.252.197.189  i-014cd20da6c22599a  ubuntu@22.04  eu-west-1b  running

COS:

$ juju status --relations
Model  Controller     Cloud/Region          Version  SLA          Timestamp
cos    aws-eu-west-1  demo-spark/localhost  3.3.1    unsupported  09:43:46Z

App                         Version  Status  Scale  Charm                       Channel  Rev  Address         Exposed  Message
alertmanager                0.25.0   active      1  alertmanager-k8s            stable    96  10.152.183.187  no       
catalogue                            active      1  catalogue-k8s               stable    33  10.152.183.133  no       
cos-configuration-k8s       3.5.0    active      1  cos-configuration-k8s       stable    42  10.152.183.234  no       
grafana                     9.2.1    active      1  grafana-k8s                 stable    93  10.152.183.51   no       
loki                        2.7.4    active      1  loki-k8s                    stable   105  10.152.183.236  no       
prometheus                  2.47.2   active      1  prometheus-k8s              stable   159  10.152.183.199  no       
prometheus-pushgateway-k8s  1.6.2    active      1  prometheus-pushgateway-k8s  edge       7  10.152.183.241  no       
traefik                     2.10.4   active      1  traefik-k8s                 stable   166  150.0.0.1       no       

Unit                           Workload  Agent  Address      Ports  Message
alertmanager/0*                active    idle   10.1.45.248         
catalogue/0*                   active    idle   10.1.45.221         
cos-configuration-k8s/0*       active    idle   10.1.45.218         
grafana/0*                     active    idle   10.1.45.204         
loki/0*                        active    idle   10.1.45.208         
prometheus-pushgateway-k8s/0*  active    idle   10.1.45.195         
prometheus/0*                  active    idle   10.1.45.222         
traefik/0*                     active    idle   10.1.45.215         

Offer                            Application   Charm             Rev  Connected  Endpoint              Interface                Role
alertmanager-karma-dashboard     alertmanager  alertmanager-k8s  96   0/0        karma-dashboard       karma_dashboard          provider
grafana-dashboards               grafana       grafana-k8s       93   1/1        grafana-dashboard     grafana_dashboard        requirer
loki-logging                     loki          loki-k8s          105  1/1        logging               loki_push_api            provider
prometheus-receive-remote-write  prometheus    prometheus-k8s    159  1/1        receive-remote-write  prometheus_remote_write  provider
prometheus-scrape                prometheus    prometheus-k8s    159  0/0        metrics-endpoint      prometheus_scrape        requirer
traefik                          traefik       traefik-k8s       166  1/1        ingress               ingress                  provider

Integration provider                          Requirer                                      Interface                  Type     Message
alertmanager:alerting                         loki:alertmanager                             alertmanager_dispatch      regular  
alertmanager:alerting                         prometheus:alertmanager                       alertmanager_dispatch      regular  
alertmanager:grafana-dashboard                grafana:grafana-dashboard                     grafana_dashboard          regular  
alertmanager:grafana-source                   grafana:grafana-source                        grafana_datasource         regular  
alertmanager:replicas                         alertmanager:replicas                         alertmanager_replica       peer     
alertmanager:self-metrics-endpoint            prometheus:metrics-endpoint                   prometheus_scrape          regular  
catalogue:catalogue                           alertmanager:catalogue                        catalogue                  regular  
catalogue:catalogue                           grafana:catalogue                             catalogue                  regular  
catalogue:catalogue                           prometheus:catalogue                          catalogue                  regular  
catalogue:replicas                            catalogue:replicas                            catalogue_replica          peer     
cos-configuration-k8s:grafana-dashboards      grafana:grafana-dashboard                     grafana_dashboard          regular  
cos-configuration-k8s:replicas                cos-configuration-k8s:replicas                cos_configuration_replica  peer     
grafana:grafana                               grafana:grafana                               grafana_peers              peer     
grafana:metrics-endpoint                      prometheus:metrics-endpoint                   prometheus_scrape          regular  
grafana:replicas                              grafana:replicas                              grafana_replicas           peer     
loki:grafana-dashboard                        grafana:grafana-dashboard                     grafana_dashboard          regular  
loki:grafana-source                           grafana:grafana-source                        grafana_datasource         regular  
loki:metrics-endpoint                         prometheus:metrics-endpoint                   prometheus_scrape          regular  
loki:replicas                                 loki:replicas                                 loki_replica               peer     
prometheus-pushgateway-k8s:metrics-endpoint   prometheus:metrics-endpoint                   prometheus_scrape          regular  
prometheus-pushgateway-k8s:pushgateway-peers  prometheus-pushgateway-k8s:pushgateway-peers  pushgateway_peers          peer     
prometheus:grafana-dashboard                  grafana:grafana-dashboard                     grafana_dashboard          regular  
prometheus:grafana-source                     grafana:grafana-source                        grafana_datasource         regular  
prometheus:prometheus-peers                   prometheus:prometheus-peers                   prometheus_peers           peer     
traefik:ingress                               alertmanager:ingress                          ingress                    regular  
traefik:ingress                               catalogue:ingress                             ingress                    regular  
traefik:ingress-per-unit                      loki:ingress                                  ingress_per_unit           regular  
traefik:ingress-per-unit                      prometheus:ingress                            ingress_per_unit           regular  
traefik:metrics-endpoint                      prometheus:metrics-endpoint                   prometheus_scrape          regular  
traefik:peers                                 traefik:peers                                 traefik_peers              peer     
traefik:traefik-route                         grafana:ingress                               traefik_route              regular  

cos-configuration-k8s config:

juju config cos-configuration-k8s
application: cos-configuration-k8s
application-config: 
  juju-application-path: 
    default: /
    description: the relative http path used to access an application
    source: default
    type: string
    value: /
  juju-external-hostname: 
    description: the external hostname of an exposed application
    source: unset
    type: string
  kubernetes-ingress-allow-http: 
    default: false
    description: whether to allow HTTP traffic to the ingress controller
    source: default
    type: bool
    value: false
  kubernetes-ingress-class: 
    default: nginx
    description: the class of the ingress controller to be used by the ingress resource
    source: default
    type: string
    value: nginx
  kubernetes-ingress-ssl-passthrough: 
    default: false
    description: whether to passthrough SSL traffic to the ingress controller
    source: default
    type: bool
    value: false
  kubernetes-ingress-ssl-redirect: 
    default: false
    description: whether to redirect SSL traffic to the ingress controller
    source: default
    type: bool
    value: false
  kubernetes-service-annotations: 
    description: a space separated set of annotations to add to the service
    source: unset
    type: attrs
  kubernetes-service-external-ips: 
    description: list of IP addresses for which nodes in the cluster will also accept
      traffic
    source: unset
    type: string
  kubernetes-service-externalname: 
    description: external reference that kubedns or equivalent will return as a CNAME
      record
    source: unset
    type: string
  kubernetes-service-loadbalancer-ip: 
    description: LoadBalancer will get created with the IP specified in this field
    source: unset
    type: string
  kubernetes-service-loadbalancer-sourceranges: 
    description: traffic through the load-balancer will be restricted to the specified
      client IPs
    source: unset
    type: string
  kubernetes-service-target-port: 
    description: name or number of the port to access on the pods targeted by the
      service
    source: unset
    type: string
  kubernetes-service-type: 
    description: determines how the Service is exposed
    source: unset
    type: string
  trust: 
    default: false
    description: Does this application have access to trusted credentials
    source: default
    type: bool
    value: false
charm: cos-configuration-k8s
settings: 
  git_branch: 
    default: master
    description: The git branch to check out.
    source: user
    type: string
    value: dashboard
  git_depth: 
    default: 1
    description: |
      Cloning depth, to truncate commit history to the specified number of commits. Zero means no truncating.
    source: default
    type: int
    value: 1
  git_repo: 
    description: URL to repo to clone and sync against.
    source: user
    type: string
    value: https://github.com/canonical/charmed-spark-rock
  git_rev: 
    default: HEAD
    description: The git revision (tag or hash) to check out
    source: default
    type: string
    value: HEAD
  git_ssh_key: 
    description: |
      An optional SSH private key to use when cloning the repository.
    source: unset
    type: string
  grafana_dashboards_path: 
    default: grafana_dashboards
    description: Relative path in repo to grafana dashboards.
    source: user
    type: string
    value: dashboards/prod/grafana/
  loki_alert_rules_path: 
    default: loki_alert_rules
    description: Relative path in repo to loki rules.
    source: default
    type: string
    value: loki_alert_rules
  prometheus_alert_rules_path: 
    default: prometheus_alert_rules
    description: Relative path in repo to prometheus rules.
    source: default
    type: string
    value: prometheus_alert_rules
deusebio commented 9 months ago

The way metrics end up in Grafana is that Spark jobs push metrics to prometheus-pushgateway, that component is then scraped by prometheus and metrics are finally exposed in grafana.

With the settings spark.metrics.conf.driver.sink.prometheus.period you specify how frequent you push to pushgateway from sparkjobs. However before that ends up in prometheus/grafana is controlled by the scraping internal of prometheus. Note that the scraping interval of prometheus is 60 seconds by default. You could use prometheus-scrape-config-k8s to custom this value, to match more with the frequency that you push metrics from SparkJobs.

Anyhow, you could also just check that metrics are indeed pushed to pushgateway by accessing its endpoint http://<pushgateway_ip>:9091. I have checked and even for short jobs (like the one you submitted), metrics are indeed there. But if you don't configure prometheus otherwise, it will take 1minute before those end up in grafana. Of course if SparkJobs processes fails before pushing metrics, you won't have any metric and you can just look at the pod logs.

Hope this helps!