canonical / spark-history-server-k8s-operator

This repository is for the Charmed Spark History Server operator to be deployed with juju
Apache License 2.0
0 stars 2 forks source link

Spark history server unit BLOCKED - Missing S3 relation #31

Open Barteus opened 7 months ago

Barteus commented 7 months ago

Steps to reproduce

  1. Deploy microk8s using charm with grafana-agent
  2. Enable microk8s minio
  3. juju add-model spark
  4. deploy spark-history-server-k8s (3.4/stable)
  5. deploy s3 integrator
  6. Configure the credentials using action (bucket do not exist at this point), try incorrect credentials & URL few times before providing proper
  7. Use COS traefik as CMR
  8. Create a bucket after 5 minutes
  9. Remove the Pod manually to take new settings

Expected behavior

Charm status - active

Actual behavior

The history server works correctly, but the juju unit status is BLOCKED.

Configuration changes do not help. Running action to provide S3 integrator with the same KEYS.

Versions

Operating system: Ubuntu 22.04.3 LTS

Juju CLI: 3.3.1

Juju agent: 3.3.1

Charm revision:

$ juju status --relations
Model  Controller     Cloud/Region          Version  SLA          Timestamp
spark  aws-eu-west-1  demo-spark/localhost  3.3.1    unsupported  08:45:20Z

SAAS         Status  Store          URL
cos-traefik  active  aws-eu-west-1  admin/cos.traefik

App                       Version  Status   Scale  Charm                     Channel     Rev  Address         Exposed  Message
s3-integrator                      active       1  s3-integrator             edge         14  10.152.183.19   no       
spark-history-server-k8s           waiting      1  spark-history-server-k8s  3.4/stable   15  10.152.183.249  no       waiting for units to settle down

Unit                         Workload  Agent  Address      Ports  Message
s3-integrator/0*             active    idle   10.1.45.200         
spark-history-server-k8s/0*  blocked   idle   10.1.45.255         Missing S3 relation

Integration provider               Requirer                                 Interface            Type     Message
cos-traefik:ingress                spark-history-server-k8s:ingress         ingress              regular  
s3-integrator:s3-credentials       spark-history-server-k8s:s3-credentials  s3                   regular  
s3-integrator:s3-integrator-peers  s3-integrator:s3-integrator-peers        s3-integrator-peers  peer     

microk8s:

juju status
Model     Controller     Cloud/Region   Version  SLA          Timestamp
microk8s  aws-eu-west-1  aws/eu-west-1  3.3.1    unsupported  09:18:19Z

SAAS              Status  Store          URL
cos-alertmanager  active  aws-eu-west-1  admin/cos.alertmanager-karma-dashboard
cos-grafana       active  aws-eu-west-1  admin/cos.grafana-dashboards
cos-loki          active  aws-eu-west-1  admin/cos.loki-logging
cos-prometheus    active  aws-eu-west-1  admin/cos.prometheus-receive-remote-write

App                Version  Status  Scale  Charm          Channel      Rev  Exposed  Message
grafana-agent-cos           active      1  grafana-agent  latest/edge   28  no       
microk8s           1.29.1   active      1  microk8s       latest/edge  232  yes      node is ready

Unit                    Workload  Agent  Machine  Public address  Ports      Message
microk8s/0*             active    idle   0        3.252.197.189   16443/tcp  node is ready
  grafana-agent-cos/0*  active    idle            3.252.197.189              

Machine  State    Address        Inst id              Base          AZ          Message
0        started  3.252.197.189  i-014cd20da6c22599a  ubuntu@22.04  eu-west-1b  running

Log output

Juju debug log: log.txt

github-actions[bot] commented 7 months ago

https://warthogs.atlassian.net/browse/DPE-3469

deusebio commented 7 months ago

Hi @Barteus !

Yes, the issue here is that the charms keeps staying in blocked status since the bucket was not there when the relation was created. One way to work around this could be to remove and recreate the relation after the bucket is created, and in general we indeed advise to create the bucket beforehand.

Although this does not seem a very critical or blocking issue, I agree that as a further improvement we could automate a bit the process and also do some checks to revert the status from blocked to active on update status events. Given this, I'll leave this issue open. Hopefully, we will have some spare cycles to work on this in the next weeks.