canonical / postgresql-operator

A Charmed Operator for running PostgreSQL on machines
https://charmhub.io/postgresql
Apache License 2.0
8 stars 18 forks source link

MAAS/LXD is not supported - cannot assign unit "postgresql/0" to machine 0/lxd/0: adding storage to lxd container not supported #549

Open nobuto-m opened 1 month ago

nobuto-m commented 1 month ago

Using LXD is pretty common in MAAS provider to save the total number of machines instead of assigning a bare metal machine to each unit of micro services.

For example, the official landscape-dense-maas bundle uses that architecture. https://ubuntu.com/landscape/docs/juju-installation#heading--landscape-dense-maas-bundle (the current bundle is using the "legacy" postgresql charm so it works, but it will eventually move away from the legacy one.)

Steps to reproduce

  1. prepare a MAAS provider for Juju
  2. deploy the postgresql charm to a LXD container on top of a bare metal machine $ juju deploy postgresql --base ubuntu@22.04 --to lxd

Expected behavior

The deployment of the charm gets green.

Actual behavior

It errors out with: cannot assign unit "postgresql/0" to machine 0/lxd/0: adding storage to lxd container not supported

$ juju status
Model                   Controller       Cloud/Region  Version  SLA          Timestamp
maas-juju-lxd-postgres  maas-controller  maas/default  3.5.2    unsupported  13:54:53Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql           error     0/1  postgresql  14/stable  429  no       cannot assign unit "postgresql/0" to machine 0/lxd/0: adding storage to lxd container not supported

Unit          Workload  Agent  Machine  Public address  Ports  Message
postgresql/0  error     lost                                   cannot assign unit "postgresql/0" to machine 0/lxd/0: adding storage to lxd container not supported

Versions

Operating system: 22.04 LTS

Juju CLI: 3.5.2-genericlinux-amd64

Juju agent: 3.5.2

Charm revision: 14/stable 429

LXD: 5.0.3

Log output

Juju debug log:

machine-0: 13:54:35 INFO juju.apiserver.charmdownloader downloading charm "ch:amd64/jammy/postgresql-429"
machine-0: 13:54:35 INFO juju.state new machine "0" has preferred addresses: private "", public ""
machine-0: 13:54:35 INFO juju.state new machine "0/lxd/0" has preferred addresses: private "", public ""
machine-0: 13:54:37 WARNING juju.apiserver.provisioner failed to save published image metadata: missing region: metadata for image  not valid
machine-0: 13:54:37 INFO juju.apiserver.common setting password for "machine-0"
machine-0: 13:54:38 INFO juju.cloudconfig Fetching agent: curl -sSf --connect-timeout 20 --noproxy "*" --insecure -o $bin/tools.tar.gz <[https://192.168.151.101:17070/model/898084f1-ae97-4a6d-801a-b0183893494e/tools/3.5.2-ubuntu-amd64]>
machine-0: 13:54:47 INFO juju.state machine "0" preferred private address changed from "" to "local-cloud:192.168.151.102@space:1"
machine-0: 13:54:47 INFO juju.state machine "0" preferred public address changed from "" to "local-cloud:192.168.151.102@space:1"
machine-0: 13:55:06 INFO juju.worker.leaseexpiry expired 1 leases
unit-controller-0: 13:55:27 DEBUG unit.controller/0.juju-log ops 2.14.0 up and running.
unit-controller-0: 13:55:27 DEBUG unit.controller/0.juju-log Emitting Juju event update_status.
unit-controller-0: 13:55:28 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
machine-0: 13:59:02 INFO juju.apiserver.connection agent login: machine-0 for 898084f1-ae97-4a6d-801a-b0183893494e
machine-0: 13:59:02 INFO juju.apiserver.common setting password for "machine-0"
machine-0: 13:59:02 INFO juju.apiserver.connection agent disconnected: machine-0 for 898084f1-ae97-4a6d-801a-b0183893494e
machine-0: 13:59:02 INFO juju.apiserver.connection agent login: machine-0 for 898084f1-ae97-4a6d-801a-b0183893494e
machine-0: 13:59:02 INFO juju.apiserver.common setting password for "machine-0"
machine-0: 13:59:03 WARNING juju.apiserver.instancemutater unit postgresql/0 has no machine id, start watching when machine id assigned.
machine-0: 13:59:03 INFO juju.apiserver.common.networkingcommon machine "0": adding new device "lo" () with addresses [127.0.0.1/8 ::1/128]
machine-0: 13:59:03 INFO juju.apiserver.common.networkingcommon machine "0": adding new device "enp1s0" (52:54:00:1f:e2:16) with addresses [192.168.151.102/24]
machine-0: 13:59:03 INFO juju.apiserver.common.networkingcommon machine "0": adding new device "enp2s0" (52:54:00:86:fb:26) with addresses []
machine-0: 13:59:18 INFO juju.apiserver.common setting password for "machine-0-lxd-0"
machine-0: 13:59:18 INFO juju.network.containerizer device "enp2s0" has no addresses, ignoring
machine-0: 13:59:28 INFO juju.apiserver.common.networkingcommon machine "0": adding new device "br-enp1s0" (2e:ca:2a:24:d8:38) with addresses [192.168.151.102/24]
machine-0: 13:59:29 INFO juju.network.containerizer device "enp2s0" has no addresses, ignoring
machine-0: 14:00:45 INFO juju.state LogTailer starting oplog tailing: recent id count=10, lastTime=2024-07-25 13:59:30.558052569 +0000 UTC, minOplogTs=2024-07-25 13:58:30.558052569 +0000 UTC
unit-controller-0: 14:01:16 DEBUG unit.controller/0.juju-log ops 2.14.0 up and running.
unit-controller-0: 14:01:16 DEBUG unit.controller/0.juju-log Emitting Juju event update_status.
unit-controller-0: 14:01:16 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
machine-0: 14:02:08 INFO juju.state LogTailer starting oplog tailing: recent id count=1053, lastTime=2024-07-25 14:01:16.689594064 +0000 UTC, minOplogTs=2024-07-25 14:00:16.689594064 +0000 UTC

Additional context

github-actions[bot] commented 1 month ago

https://warthogs.atlassian.net/browse/DPE-4978

dragomirp commented 1 month ago

Hi, @nobuto-m, this is the expected Juju behaviour. See https://github.com/canonical/postgresql-operator/issues/354 and https://bugs.launchpad.net/juju/+bug/2060098 for more details.

nobuto-m commented 1 month ago

I get that Juju doesn't support Juju storage for MAAS/LXD scenario. The point is it was a conscious decision by the charm of introducing Juju storage and it broke a real world use case.

jnsgruk commented 1 month ago

In general I'd argue that using Juju Storage is absolutely the right decision here, though I'm a little surprised Juju doesn't support LXD containers. I'll try to dig in a bit with the Juju team and understand

jnsgruk commented 1 month ago

(One thing you could try in the mean time is use the virt-type constraint to deploy a LXD VM instead? I don't know off the top of my head that it'll work like that, but might be worth exploring)

jameinel commented 1 month ago

LXD and storage attributes of a charm should be supported (even when that LXD container is on MAAS), though it is possible that we got something mixed up.

Certainly it does work directly with LXD:

$ juju status
Model       Controller  Cloud/Region         Version  SLA          Timestamp
controller  lxd35       localhost/localhost  3.5.2    unsupported  16:04:27-04:00

App         Version  Status  Scale  Charm            Channel     Rev  Exposed  Message
controller           active      1  juju-controller  3.5/stable  105  no
postgresql  14.11    active      1  postgresql       14/stable   429  no

Unit           Workload  Agent  Machine  Public address  Ports     Message
controller/0*  active    idle   0        10.139.162.41
postgresql/0*  active    idle   1        10.139.162.212  5432/tcp  Primary

Machine  State    Address         Inst id        Base          AZ  Message
0        started  10.139.162.41   juju-7f8bd3-0  ubuntu@22.04      Running
1        started  10.139.162.212  juju-7f8bd3-1  ubuntu@22.04      Running

$ juju storage
Unit          Storage ID  Type        Pool    Size    Status    Message
postgresql/0  pgdata/0    filesystem  rootfs  48 GiB  attached

You can see that we are aware of the Postgresql's storage request, but we are fullfilling that request from the "rootfs" pool, which just means use the root disk rather than mounting additional storage.

What Juju doesn't support is pass-through. (provisioning storage as an AWS EBS volume, and getting that mounted into the container, or mounting host devices into the container).

It is plausible that Juju broke something wrt storage provisioning. I can see the same behavior trying to reproduce on AWS.

$ juju status
Model    Controller  Cloud/Region   Version  SLA          Timestamp
pg-test  jam-aws     aws/us-east-1  3.5.2    unsupported  16:21:18-04:00

App         Version  Status   Scale  Charm       Channel    Rev  Exposed  Message
pg2                  error      0/1  postgresql  14/stable  429  no       cannot assign unit "pg2/0" to machine 1/lxd/0: adding storage to lxd container not supported

Juju should be fine to use rootfs storage on AWS for an LXD container (and same for MAAS). I'm trying to dig a bit more and see where we might have gone wrong.

jameinel commented 1 month ago
$ juju status
Model    Controller  Cloud/Region   Version  SLA          Timestamp
pg-test  jam-aws     aws/us-east-1  3.5.2    unsupported  16:35:56-04:00

App  Version  Status   Scale  Charm       Channel    Rev  Exposed  Message
pg1           waiting      1  postgresql  14/stable  429  no       waiting to start PostgreSQL

Unit    Workload  Agent      Machine  Public address  Ports  Message
pg1/0*  waiting   executing  0        54.144.107.165         (leader-elected) waiting to start PostgreSQL

Machine  State    Address         Inst id              Base          AZ          Message
0        started  54.144.107.165  i-0934acabe2f83b8ed  ubuntu@22.04  us-east-1c  running

$ juju storage
Unit   Storage ID  Type        Pool    Size     Status    Message
pg1/0  pgdata/0    filesystem  rootfs  7.6 GiB  attached

Which does, indeed, use filesystem storage if I deploy directly to a host and haven't configured storage.

However, it does fail immediately with a container:

$ juju add-unit pg1 --to lxd:0
ERROR acquiring machine to host unit "pg1/1": cannot assign unit "pg1/1" to machine 0/lxd/0: adding storage to lxd container not supported (not supported)

(But absolutely worked with exactly that storage definition when deploying on the LXD provider.)

jameinel commented 1 month ago

However, I did go back to a rather old juju (2.9) and it operates exactly the same way:

$ juju status
Model    Controller  Cloud/Region   Version  SLA          Timestamp
default  jam-aws     aws/us-east-1  2.9.50   unsupported  17:27:54-04:00

App  Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
pg1           active    1/2  postgresql  14/stable  429  no

Unit    Workload  Agent       Machine  Public address  Ports     Message
pg1/0*  active    idle        0        44.222.140.206  5432/tcp  Primary
pg1/1   waiting   allocating                                     waiting for machine

Machine  State    Address         Inst id              Series  AZ          Message
0        started  44.222.140.206  i-0cd17af46e0376b0e  jammy   us-east-1a  running
0/lxd/0  pending                  pending              jammy

jameinel@jammy:~
$ juju add-unit --to lxd:0 pg1
ERROR acquiring machine to host unit "pg1/2": cannot assign unit "pg1/2" to machine 0/lxd/1: adding storage to lxd container not supported (not supported)

As mentioned, it should be possible to support rootfs storage for applications deployed to containers within another provider, but it looks like we never implemented that support.

nobuto-m commented 1 month ago

To be clear, I'm not questioning about how useful Juju storage is. It's just a known issue that Juju storage support is missing in Juju for MAAS/LXD scenario and employing Juju storage into a charm is known broken for this scenario in the machine charm world (k8s charms are straightforward obviously).

My request is either:

I don't have a clear idea on how much effort is required for each action so I'm leaving this here and wait for the plan by engineering teams.

jnsgruk commented 1 month ago

@taurus-forever i'd be interested to see if the storage limit solves the problem here, at least temporarily.

@wallyworld has taken a look at implementing this and it doesn't seem like a huge amount of work - and could possibly land in 3.6 (even if not in 3.6.0), but let's see if the simple workaround above works before we shuffle Juju's cards around too much.

taurus-forever commented 1 month ago

@jnsgruk , quickly tested --to lxd:

juju status:

ubuntu@juju350:~$ juju status 
Model   Controller  Cloud/Region         Version  SLA          Timestamp
pg2404  lxd         localhost/localhost  3.5.2    unsupported  11:57:18+02:00

App           Version  Status   Scale  Charm       Channel  Rev  Exposed  Message
psql-edge              error      0/1  postgresql  14/edge  444  no       cannot assign unit "psql-edge/0" to machine 4/lxd/0: adding storage to lxd container not supported
psql-limit1            error      0/1  postgresql             1  no       cannot assign unit "psql-limit1/0" to machine 6/lxd/0: adding storage to lxd container not supported
psql-limit01           waiting    0/1  postgresql             0  no       waiting for machine

Unit            Workload  Agent       Machine  Public address  Ports  Message
psql-edge/0     error     lost                                        cannot assign unit "psql-edge/0" to machine 4/lxd/0: adding storage to lxd container not supported
psql-limit1/0   error     lost                                        cannot assign unit "psql-limit1/0" to machine 6/lxd/0: adding storage to lxd container not supported
psql-limit01/0  waiting   allocating  5/lxd/0                         waiting for machine

Machine  State    Address         Inst id              Base          AZ  Message
4        started  10.142.152.170  juju-247c76-4        ubuntu@22.04      Running
4/lxd/0  pending                  juju-247c76-4-lxd-0  ubuntu@22.04      Container started
5        started  10.142.152.79   juju-247c76-5        ubuntu@22.04      Running
5/lxd/0  pending                  juju-247c76-5-lxd-0  ubuntu@22.04      Container started
6        started  10.142.152.156  juju-247c76-6        ubuntu@22.04      Running
6/lxd/0  pending                  juju-247c76-6-lxd-0  ubuntu@22.04      Container started

The limit: 0-1 waiting freeze is constantly reproducible on my side. No useful information in debug-log. We are happy to test all other possible workarounds here.

dragomirp commented 1 month ago

IIRC tweaking the storage directives also cause issues with refreshing. If we change the storage definition, we should double-check that upgrades still work.

taurus-forever commented 1 month ago

@dragomirp is referring to https://bugs.launchpad.net/juju/+bug/1995074 mainly: We are currently removing the description field to be able to refresh the charm (see: https://github.com/canonical/postgresql-k8s-operator/pull/218).

nobuto-m commented 1 month ago

To clarify, this is going to be an important issue eventually but not blocking any field engagement as far as I'm concerned. Other issues could be prioritized if there is any blocking one.

jnsgruk commented 1 month ago

Thanks @nobuto-m - I think @wallyworld has this on the backlog, with a chance it could land in 3.6 so it's ready for the 3.x LTS.

wallyworld commented 1 month ago

Just to chime in - it's a 9 year old TODO from the initial implementation of storage support and how storage works with containers. We can look at a fix for 3.6

taurus-forever commented 1 month ago

We will keep this ticket open for a while to re-test 14/stable charm once Juju support nested LXD. The Juju work will happen in https://bugs.launchpad.net/juju/+bug/2060098

wallyworld commented 1 month ago

This is the juju fix https://github.com/juju/juju/pull/17830 Note - I've raised a new bug https://bugs.launchpad.net/juju/+bug/2074379 just for this specific fix. The other bug is for cloud provided storage like EBS volumes etc which is a much bigger scope.

taurus-forever commented 1 month ago

@wallyworld what is the easiest way to test https://github.com/juju/juju/pull/17830? Wait for juju 3.6-beta2?

taurus-forever commented 4 weeks ago

Hi @wallyworld ,

I have tried to confirm the fix on Juju 3.6-beta2 (from 3.6/beta) and 3.6-beta3.1 (from 3.6/edge), but both are still not working for me. Was the fix included somewhere?

STR:

juju deploy postgresql --base ubuntu@22.04 --to lxd

Juju status:

ubuntu@juju360:~$ juju status -m test2
Model  Controller  Cloud/Region         Version      SLA          Timestamp
test2  test        localhost/localhost  3.6-beta3.1  unsupported  09:37:40+02:00

App         Version  Status   Scale  Charm       Channel    Rev  Exposed  Message
postgresql           waiting    0/1  postgresql  14/stable  429  no       waiting for machine

Unit          Workload  Agent       Machine  Public address  Ports  Message
postgresql/0  waiting   allocating  0/lxd/0                         waiting for machine

Machine  State    Address         Inst id              Base          AZ  Message
0        started  10.189.210.214  juju-d43457-0        ubuntu@22.04      Running
0/lxd/0  pending                  juju-d43457-0-lxd-0  ubuntu@22.04      Container started
ubuntu@juju360:~$ 

Debug-log:

machine-0: 23:11:14 INFO juju.worker.authenticationworker "machine-0" key updater worker started
machine-0: 23:11:14 INFO juju.worker.machiner "machine-0" started
machine-0: 23:11:17 INFO juju.worker.kvmprovisioner machine-0 does not support kvm container
machine-0: 23:11:17 INFO juju.packaging.manager Running: snap info lxd
machine-0: 23:11:18 INFO juju.container.lxd LXD snap is already installed (channel: 5.0/stable/ubuntu-22.04); skipping package installation
machine-0: 23:11:23 INFO juju.container.lxd Availability zone will be empty for this container manager
machine-0: 23:11:23 INFO juju.worker.lxdprovisioner entering provisioner task loop; using provisioner pool with 4 workers
machine-0: 23:11:23 INFO juju.worker.lxdprovisioner found machine pending provisioning id:0/lxd/0, details:0/lxd/0
machine-0: 23:11:23 WARNING juju.container.broker no name servers supplied by provider, using host's name servers.
machine-0: 23:11:23 WARNING juju.container.broker no search domains supplied by provider, using host's search domains.
machine-0: 23:11:23 WARNING juju.container.broker incomplete DNS config found, discovering host's DNS config
machine-0: 23:17:56 INFO juju.cloudconfig Fetching agent: curl -sSf --connect-timeout 20 --noproxy "*" --insecure -o $bin/tools.tar.gz <[https://10.189.210.169:17070/model/00974701-bde1-4940-8cd4-f94546d43457/tools/3.6-beta3.1-ubuntu-amd64]>
machine-0: 23:17:56 INFO juju.container.lxd starting new container "juju-d43457-0-lxd-0" (image "ubuntu-22.04-server-cloudimg-amd64-lxd.tar.xz")
machine-0: 23:18:00 INFO juju.worker.lxdprovisioner started machine 0/lxd/0 as instance juju-d43457-0-lxd-0 with hardware "arch=amd64", network config [], volumes [], volume attachments map[], subnets to zones [], lxd profiles []
machine-0: 23:18:00 INFO juju.worker.instancemutater.container no changes necessary to machine-0/lxd/0 lxd profiles ([default])
controller-0: 23:25:18 INFO juju.worker.instancepoller machine "0" (instance ID "juju-d43457-0") has new addresses: [local-cloud:10.189.210.214@alpha local-cloud:10.218.61.1@alpha local-cloud:fd42:e252:fc0c:db2d::1@alpha]

Tnx!

nobuto-m commented 4 weeks ago

Model Controller Cloud/Region Version SLA Timestamp test2 test localhost/localhost 3.6-beta3.1 unsupported 09:37:40+02:00

Is it a MAAS provider actually? It looks like localhost LXD and if that's the case isn't it a different issue?

taurus-forever commented 4 weeks ago

Is it a MAAS provider actually? It looks like localhost LXD and if that's the case isn't it a different issue?

It was a quick test without MAAS, will repeat on MAAS. Tnx for pointing!