Short running Jobs, Drivers or Executors are not visible in the Grafana Dashboard

Reproduce

Deploy microk8s using charm + grafana agent
Deploy COS
Deploy prometheus gateway + cos-configuration-k8s

Run a job using spark-submit

spark-client.spark-submit --deploy-mode cluster \
--num-executors 5 \
--master k8s://https://0.0.0.0:16443 \
--conf  spark.app.name=spark-demo  \
--conf  spark.eventLog.enabled=true  \
--conf  spark.driver.memory=4g \
--conf  spark.eventLog.dir=s3a://history-server/spark-events  \
--conf  spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider  \
--conf  spark.hadoop.fs.s3a.connection.ssl.enabled=false  \
--conf  spark.hadoop.fs.s3a.path.style.access=true  \
--conf  spark.hadoop.fs.s3a.access.key=xxx  \
--conf  spark.hadoop.fs.s3a.endpoint=http://10.152.183.175  \
--conf  spark.hadoop.fs.s3a.secret.key=xxx  \
--conf  spark.history.fs.logDirectory=s3a://history-server/spark-events/  \
--conf  spark.metrics.conf.driver.sink.prometheus.pushgateway-address=10.152.183.241:9091  \
--conf  spark.metrics.conf.driver.sink.prometheus.class=org.apache.spark.banzaicloud.metrics.sink.PrometheusSink  \
--conf  spark.metrics.conf.driver.sink.prometheus.enable-dropwizard-collector=true  \
--conf  spark.metrics.conf.driver.sink.prometheus.enable-dropwizard-collector=true  \
--conf  spark.metrics.conf.driver.sink.prometheus.period=1  \
--conf  spark.metrics.conf.driver.sink.prometheus.metrics-name-capture-regex='([a-z0-9]*_[a-z0-9]*_[a-z0-9]*_)(.+)'  \
--conf  spark.metrics.conf.driver.sink.prometheus.metrics-name-replacement=\$2  \
--conf  spark.metrics.conf.executor.sink.prometheus.pushgateway-address=10.152.183.241:9091  \
--conf  spark.metrics.conf.executor.sink.prometheus.class=org.apache.spark.banzaicloud.metrics.sink.PrometheusSink  \
--conf  spark.metrics.conf.executor.sink.prometheus.enable-dropwizard-collector=true  \
--conf  spark.metrics.conf.executor.sink.prometheus.period=1  \
--conf  spark.metrics.conf.executor.sink.prometheus.metrics-name-capture-regex='([a-z0-9]*_[a-z0-9]*_[a-z0-9]*_)(.+)'  \
--conf  spark.metrics.conf.executor.sink.prometheus.metrics-name-replacement=\$2  \
--conf  spark.kubernetes.container.image=ghcr.io/canonical/charmed-spark:3.4-22.04_edge  \
--class org.apache.spark.examples.SparkPi local:///opt/spark/examples/jars/spark-examples_2.12-3.4.2.jar 10

Actual

Spark jobs running less than a minute or failing before the first minute are not visible in the Grafana dashboard or in Prometheus.

Expected

All started jobs are visible in the grafana dashboard.

Versions

Operating system: Ubuntu 22.04.3 LTS

Juju CLI: 3.3.1

Juju agent: 3.3.1

Charm revision:

$ juju status --relations
Model  Controller     Cloud/Region          Version  SLA          Timestamp
spark  aws-eu-west-1  demo-spark/localhost  3.3.1    unsupported  08:45:20Z

SAAS         Status  Store          URL
cos-traefik  active  aws-eu-west-1  admin/cos.traefik

App                       Version  Status   Scale  Charm                     Channel     Rev  Address         Exposed  Message
s3-integrator                      active       1  s3-integrator             edge         14  10.152.183.19   no       
spark-history-server-k8s           waiting      1  spark-history-server-k8s  3.4/stable   15  10.152.183.249  no       waiting for units to settle down

Unit                         Workload  Agent  Address      Ports  Message
s3-integrator/0*             active    idle   10.1.45.200         
spark-history-server-k8s/0*  blocked   idle   10.1.45.255         Missing S3 relation

Integration provider               Requirer                                 Interface            Type     Message
cos-traefik:ingress                spark-history-server-k8s:ingress         ingress              regular  
s3-integrator:s3-credentials       spark-history-server-k8s:s3-credentials  s3                   regular  
s3-integrator:s3-integrator-peers  s3-integrator:s3-integrator-peers        s3-integrator-peers  peer

microk8s:

juju status
Model     Controller     Cloud/Region   Version  SLA          Timestamp
microk8s  aws-eu-west-1  aws/eu-west-1  3.3.1    unsupported  09:18:19Z

SAAS              Status  Store          URL
cos-alertmanager  active  aws-eu-west-1  admin/cos.alertmanager-karma-dashboard
cos-grafana       active  aws-eu-west-1  admin/cos.grafana-dashboards
cos-loki          active  aws-eu-west-1  admin/cos.loki-logging
cos-prometheus    active  aws-eu-west-1  admin/cos.prometheus-receive-remote-write

App                Version  Status  Scale  Charm          Channel      Rev  Exposed  Message
grafana-agent-cos           active      1  grafana-agent  latest/edge   28  no       
microk8s           1.29.1   active      1  microk8s       latest/edge  232  yes      node is ready

Unit                    Workload  Agent  Machine  Public address  Ports      Message
microk8s/0*             active    idle   0        3.252.197.189   16443/tcp  node is ready
  grafana-agent-cos/0*  active    idle            3.252.197.189              

Machine  State    Address        Inst id              Base          AZ          Message
0        started  3.252.197.189  i-014cd20da6c22599a  ubuntu@22.04  eu-west-1b  running

COS:

$ juju status --relations
Model  Controller     Cloud/Region          Version  SLA          Timestamp
cos    aws-eu-west-1  demo-spark/localhost  3.3.1    unsupported  09:43:46Z

App                         Version  Status  Scale  Charm                       Channel  Rev  Address         Exposed  Message
alertmanager                0.25.0   active      1  alertmanager-k8s            stable    96  10.152.183.187  no       
catalogue                            active      1  catalogue-k8s               stable    33  10.152.183.133  no       
cos-configuration-k8s       3.5.0    active      1  cos-configuration-k8s       stable    42  10.152.183.234  no       
grafana                     9.2.1    active      1  grafana-k8s                 stable    93  10.152.183.51   no       
loki                        2.7.4    active      1  loki-k8s                    stable   105  10.152.183.236  no       
prometheus                  2.47.2   active      1  prometheus-k8s              stable   159  10.152.183.199  no       
prometheus-pushgateway-k8s  1.6.2    active      1  prometheus-pushgateway-k8s  edge       7  10.152.183.241  no       
traefik                     2.10.4   active      1  traefik-k8s                 stable   166  150.0.0.1       no       

Unit                           Workload  Agent  Address      Ports  Message
alertmanager/0*                active    idle   10.1.45.248         
catalogue/0*                   active    idle   10.1.45.221         
cos-configuration-k8s/0*       active    idle   10.1.45.218         
grafana/0*                     active    idle   10.1.45.204         
loki/0*                        active    idle   10.1.45.208         
prometheus-pushgateway-k8s/0*  active    idle   10.1.45.195         
prometheus/0*                  active    idle   10.1.45.222         
traefik/0*                     active    idle   10.1.45.215         

Offer                            Application   Charm             Rev  Connected  Endpoint              Interface                Role
alertmanager-karma-dashboard     alertmanager  alertmanager-k8s  96   0/0        karma-dashboard       karma_dashboard          provider
grafana-dashboards               grafana       grafana-k8s       93   1/1        grafana-dashboard     grafana_dashboard        requirer
loki-logging                     loki          loki-k8s          105  1/1        logging               loki_push_api            provider
prometheus-receive-remote-write  prometheus    prometheus-k8s    159  1/1        receive-remote-write  prometheus_remote_write  provider
prometheus-scrape                prometheus    prometheus-k8s    159  0/0        metrics-endpoint      prometheus_scrape        requirer
traefik                          traefik       traefik-k8s       166  1/1        ingress               ingress                  provider

Integration provider                          Requirer                                      Interface                  Type     Message
alertmanager:alerting                         loki:alertmanager                             alertmanager_dispatch      regular  
alertmanager:alerting                         prometheus:alertmanager                       alertmanager_dispatch      regular  
alertmanager:grafana-dashboard                grafana:grafana-dashboard                     grafana_dashboard          regular  
alertmanager:grafana-source                   grafana:grafana-source                        grafana_datasource         regular  
alertmanager:replicas                         alertmanager:replicas                         alertmanager_replica       peer     
alertmanager:self-metrics-endpoint            prometheus:metrics-endpoint                   prometheus_scrape          regular  
catalogue:catalogue                           alertmanager:catalogue                        catalogue                  regular  
catalogue:catalogue                           grafana:catalogue                             catalogue                  regular  
catalogue:catalogue                           prometheus:catalogue                          catalogue                  regular  
catalogue:replicas                            catalogue:replicas                            catalogue_replica          peer     
cos-configuration-k8s:grafana-dashboards      grafana:grafana-dashboard                     grafana_dashboard          regular  
cos-configuration-k8s:replicas                cos-configuration-k8s:replicas                cos_configuration_replica  peer     
grafana:grafana                               grafana:grafana                               grafana_peers              peer     
grafana:metrics-endpoint                      prometheus:metrics-endpoint                   prometheus_scrape          regular  
grafana:replicas                              grafana:replicas                              grafana_replicas           peer     
loki:grafana-dashboard                        grafana:grafana-dashboard                     grafana_dashboard          regular  
loki:grafana-source                           grafana:grafana-source                        grafana_datasource         regular  
loki:metrics-endpoint                         prometheus:metrics-endpoint                   prometheus_scrape          regular  
loki:replicas                                 loki:replicas                                 loki_replica               peer     
prometheus-pushgateway-k8s:metrics-endpoint   prometheus:metrics-endpoint                   prometheus_scrape          regular  
prometheus-pushgateway-k8s:pushgateway-peers  prometheus-pushgateway-k8s:pushgateway-peers  pushgateway_peers          peer     
prometheus:grafana-dashboard                  grafana:grafana-dashboard                     grafana_dashboard          regular  
prometheus:grafana-source                     grafana:grafana-source                        grafana_datasource         regular  
prometheus:prometheus-peers                   prometheus:prometheus-peers                   prometheus_peers           peer     
traefik:ingress                               alertmanager:ingress                          ingress                    regular  
traefik:ingress                               catalogue:ingress                             ingress                    regular  
traefik:ingress-per-unit                      loki:ingress                                  ingress_per_unit           regular  
traefik:ingress-per-unit                      prometheus:ingress                            ingress_per_unit           regular  
traefik:metrics-endpoint                      prometheus:metrics-endpoint                   prometheus_scrape          regular  
traefik:peers                                 traefik:peers                                 traefik_peers              peer     
traefik:traefik-route                         grafana:ingress                               traefik_route              regular

cos-configuration-k8s config:

juju config cos-configuration-k8s
application: cos-configuration-k8s
application-config: 
  juju-application-path: 
    default: /
    description: the relative http path used to access an application
    source: default
    type: string
    value: /
  juju-external-hostname: 
    description: the external hostname of an exposed application
    source: unset
    type: string
  kubernetes-ingress-allow-http: 
    default: false
    description: whether to allow HTTP traffic to the ingress controller
    source: default
    type: bool
    value: false
  kubernetes-ingress-class: 
    default: nginx
    description: the class of the ingress controller to be used by the ingress resource
    source: default
    type: string
    value: nginx
  kubernetes-ingress-ssl-passthrough: 
    default: false
    description: whether to passthrough SSL traffic to the ingress controller
    source: default
    type: bool
    value: false
  kubernetes-ingress-ssl-redirect: 
    default: false
    description: whether to redirect SSL traffic to the ingress controller
    source: default
    type: bool
    value: false
  kubernetes-service-annotations: 
    description: a space separated set of annotations to add to the service
    source: unset
    type: attrs
  kubernetes-service-external-ips: 
    description: list of IP addresses for which nodes in the cluster will also accept
      traffic
    source: unset
    type: string
  kubernetes-service-externalname: 
    description: external reference that kubedns or equivalent will return as a CNAME
      record
    source: unset
    type: string
  kubernetes-service-loadbalancer-ip: 
    description: LoadBalancer will get created with the IP specified in this field
    source: unset
    type: string
  kubernetes-service-loadbalancer-sourceranges: 
    description: traffic through the load-balancer will be restricted to the specified
      client IPs
    source: unset
    type: string
  kubernetes-service-target-port: 
    description: name or number of the port to access on the pods targeted by the
      service
    source: unset
    type: string
  kubernetes-service-type: 
    description: determines how the Service is exposed
    source: unset
    type: string
  trust: 
    default: false
    description: Does this application have access to trusted credentials
    source: default
    type: bool
    value: false
charm: cos-configuration-k8s
settings: 
  git_branch: 
    default: master
    description: The git branch to check out.
    source: user
    type: string
    value: dashboard
  git_depth: 
    default: 1
    description: |
      Cloning depth, to truncate commit history to the specified number of commits. Zero means no truncating.
    source: default
    type: int
    value: 1
  git_repo: 
    description: URL to repo to clone and sync against.
    source: user
    type: string
    value: https://github.com/canonical/charmed-spark-rock
  git_rev: 
    default: HEAD
    description: The git revision (tag or hash) to check out
    source: default
    type: string
    value: HEAD
  git_ssh_key: 
    description: |
      An optional SSH private key to use when cloning the repository.
    source: unset
    type: string
  grafana_dashboards_path: 
    default: grafana_dashboards
    description: Relative path in repo to grafana dashboards.
    source: user
    type: string
    value: dashboards/prod/grafana/
  loki_alert_rules_path: 
    default: loki_alert_rules
    description: Relative path in repo to loki rules.
    source: default
    type: string
    value: loki_alert_rules
  prometheus_alert_rules_path: 
    default: prometheus_alert_rules
    description: Relative path in repo to prometheus rules.
    source: default
    type: string
    value: prometheus_alert_rules

canonical / charmed-spark-rock

Short running Jobs, Drivers or Executors are not visible in the Grafana Dashboard #71

Reproduce

Actual

Expected

Versions