canonical / discourse-k8s-operator

discourse-k8s-operator - charm repository.
Apache License 2.0
7 stars 5 forks source link

The S3 upload_assets routine can fail on pod startup, resulting in pages referencing S3-hosted files that don't exist #208

Closed barryprice closed 6 months ago

barryprice commented 7 months ago

Bug Description

image (1)

It's possible for Discourse to fail its S3 upload_assets routine after a restart, resulting in errors like the one above.

The HTML served from the pods loads fine, but the e.g. Javascript URLs referenced within are for paths expected to be created by the upload_assets routine, but which were never uploaded to S3.

This results in a broken page with a spinner, the main content never appears.

To Reproduce

Deploy the application with s3_enabled=True and all associated config set.

Wait for (or force) a restart.

Be unlucky enough to experience this bug (it's unclear how/why it happened).

Environment

prod-discourse-ubuntu-com-k8s@is-bastion-ps6:~$ juju version
3.1.6-ubuntu-amd64
prod-discourse-ubuntu-com-k8s@is-bastion-ps6:~$ juju status
Model                          Controller    Cloud/Region         Version  SLA          Timestamp
prod-discourse-ubuntu-com-k8s  prodstack-is  k8s-prod-is/default  3.1.6    unsupported  14:59:38Z

SAAS        Status  Store         URL
grafana     active  local         admin/prod-cos-k8s-ps6-is-charms.grafana
loki        active  local         admin/prod-cos-k8s-ps6-is-charms.loki
postgresql  active  prodstack-is  admin/prod-discourse-ubuntu-com-db.postgresql
prometheus  active  local         admin/prod-cos-k8s-ps6-is-charms.prometheus

App                       Version  Status  Scale  Charm                     Channel        Rev  Address        Exposed  Message
discourse-k8s             3.2.0    active      2  discourse-k8s             latest/stable   95  10.87.199.71   no       
nginx-ingress-integrator  25.3.0   active      1  nginx-ingress-integrator  latest/stable   81  10.87.154.21   no       
redis-k8s                 7.0.4    active      1  redis-k8s                 latest/edge     26  10.87.238.179  no       

Unit                         Workload  Agent  Address          Ports  Message
discourse-k8s/0              active    idle   192.168.102.255         
discourse-k8s/1*             active    idle   192.168.103.157         
nginx-ingress-integrator/0*  active    idle   192.168.102.249         
redis-k8s/0*                 active    idle   192.168.102.248         
prod-discourse-ubuntu-com-k8s@is-bastion-ps6:~$

This is running on an Openstack cloud, with S3 integration enabled.

Relevant log output

This behaviour doesn't appear to be logged, a simple solution might be to add retry logic to the `image/scripts/pod_setup` script, if I'm understanding how this works correctly.

Additional context

No response

mthaddon commented 7 months ago

It seems like the behaviour must be logged somewhere (and if it isn't perhaps there's a logging level we can adjust). There's juju debug-log, the workload container logs, and logs in /srv/discourse/app/log in the discourse container as well. If we can get an exact time/instance where this happens and can look at the logs relatively quickly (before they rotate out or the pod is restarted again) we should have enough info to figure out what's happening.

mthaddon commented 7 months ago

Saw this in the logs:

unit-discourse-k8s-1: 2024-04-16 03:59:43 ERROR unit.discourse-k8s/1.juju-log S3 migration failed with code 1.
Traceback (most recent call last):
  File "./src/charm.py", line 606, in _run_s3_migration
    process.wait_output()
  File "/var/lib/juju/agents/unit-discourse-k8s-1/charm/venv/ops/pebble.py", line 1441, in wait_output
    raise ExecError[AnyStr](self._command, exit_code, out_value, err_value)
ops.pebble.ExecError: non-zero exit code 1 executing ['/srv/discourse/app/bin/bundle', 'exec', 'rake', 's3:upload_assets'], stdout='gem install rrule -v 0.4.4 -i /srv/discourse/app/plugins/discourse-calendar/gems/3.2.2 --no-document --ignore-dependencies
 --no-user-install\nSuccessfully installed rrule-0.4.4\n1 gem installed\ngem install webrick -v 1.7.0 -i /srv/discourse/app/plugins/discourse-prometheus/gems/3.2.2 --no-document --ignore-dependencies --no-user-install\nSuccessfully installed webrick-1.7.
0\n1 gem installed\ngem install prometheus_exporter -v 2.0.6 -i /srv/discourse/app/plugins/discourse-prometheus/gems/3.2.2 --no-document --ignore-dependencies --no-user-install\nprometheus_exporter will only bind to localhost by default as of v0.5\nSucce
ssfully installed prometheus_exporter-2.0.6\n1 gem installed\ngem install macaddr -v 1.0.0 -i /srv/discourse/app/plugins/discourse-saml/gems/3.2.2 --no-document --ignore-dependencies --no-user-install\nSuccessfully installed macaddr-1.0.0\n1 gem installe
d\ngem install uuid -v 2.3.7 -i /srv/discourse/app/plugins/discourse-saml/gems/3.2.2 --no-document --ignore-dependencies --no-user-install\nSuccessfully i' [truncated], stderr="Couldn't connect to Redis\nrake aborted!\nRedis::CannotConnectError: Error co
nnecting to Redis on 192.168.101.92:6379 (Redis::TimeoutError)\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/redis/client.rb:398:in `rescue in establish_connection'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/red
is/client.rb:379:in `establish_connection'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/redis/client.rb:115:in `block in connect'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/redis/client.rb:344:in `with_reconnec
t'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/redis/client.rb:114:in `connect'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/redis/client.rb:409:in `ensure_connected'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0
/gems/redis-4.8.1/lib/redis/client.rb:269:in `block in process'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/redis/client.rb:356:in `logging'\n/srv/discourse/app/vendor/bundle" [truncated]
unit-discourse-k8s-1: 2024-04-16 03:59:43 ERROR unit.discourse-k8s/1.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 791, in <module>
    main(DiscourseCharm, use_juju_for_storage=True)
  File "/var/lib/juju/agents/unit-discourse-k8s-1/charm/venv/ops/main.py", line 436, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-discourse-k8s-1/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-discourse-k8s-1/charm/venv/ops/framework.py", line 351, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-discourse-k8s-1/charm/venv/ops/framework.py", line 853, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-discourse-k8s-1/charm/venv/ops/framework.py", line 942, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 194, in _on_config_changed
    self._configure_pod()
  File "./src/charm.py", line 673, in _configure_pod
    self._run_s3_migration()
  File "./src/charm.py", line 606, in _run_s3_migration
    process.wait_output()
  File "/var/lib/juju/agents/unit-discourse-k8s-1/charm/venv/ops/pebble.py", line 1441, in wait_output
    raise ExecError[AnyStr](self._command, exit_code, out_value, err_value)
ops.pebble.ExecError: non-zero exit code 1 executing ['/srv/discourse/app/bin/bundle', 'exec', 'rake', 's3:upload_assets'], stdout='gem install rrule -v 0.4.4 -i /srv/discourse/app/plugins/discourse-calendar/gems/3.2.2 --no-document --ignore-dependencies
 --no-user-install\nSuccessfully installed rrule-0.4.4\n1 gem installed\ngem install webrick -v 1.7.0 -i /srv/discourse/app/plugins/discourse-prometheus/gems/3.2.2 --no-document --ignore-dependencies --no-user-install\nSuccessfully installed webrick-1.7.
0\n1 gem installed\ngem install prometheus_exporter -v 2.0.6 -i /srv/discourse/app/plugins/discourse-prometheus/gems/3.2.2 --no-document --ignore-dependencies --no-user-install\nprometheus_exporter will only bind to localhost by default as of v0.5\nSucce
ssfully installed prometheus_exporter-2.0.6\n1 gem installed\ngem install macaddr -v 1.0.0 -i /srv/discourse/app/plugins/discourse-saml/gems/3.2.2 --no-document --ignore-dependencies --no-user-install\nSuccessfully installed macaddr-1.0.0\n1 gem installe
d\ngem install uuid -v 2.3.7 -i /srv/discourse/app/plugins/discourse-saml/gems/3.2.2 --no-document --ignore-dependencies --no-user-install\nSuccessfully i' [truncated], stderr="Couldn't connect to Redis\nrake aborted!\nRedis::CannotConnectError: Error co
nnecting to Redis on 192.168.101.92:6379 (Redis::TimeoutError)\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/redis/client.rb:398:in `rescue in establish_connection'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/red
is/client.rb:379:in `establish_connection'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/redis/client.rb:115:in `block in connect'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/redis/client.rb:344:in `with_reconnec
t'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/redis/client.rb:114:in `connect'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/redis/client.rb:409:in `ensure_connected'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0
/gems/redis-4.8.1/lib/redis/client.rb:269:in `block in process'\n/srv/discourse/app/vendor/bundle/ruby/3.2.0/gems/redis-4.8.1/lib/redis/client.rb:356:in `logging'\n/srv/discourse/app/vendor/bundle" [truncated]
unit-discourse-k8s-1: 2024-04-16 03:59:43 ERROR juju.worker.uniter.operation hook "config-changed" (via hook dispatching script: dispatch) failed: exit status 1
javierdelapuente commented 6 months ago

This should be fixed by now, as not using the stored state, if the pod restarts it should upload the assets correctly. Besides, the assets are now precompiled in the rock, so they will not change if the image does not change.