Closed martinmo closed 5 months ago
More context: the cell db pods run a MariaDB Galera cluster (using the bitnami/mariadb-galera:11.0.4-debian-11-r6 image) as recommended by the Large Scale SIG. A quick glance at the logs of one of the pods (kubectl logs nova-cell0-x86fc-db-0 -n yaook
) surfaces a lot of WARNING messages of the form, appearing every few seconds:
...
2024-04-09 9:22:08 61002 [Warning] Aborted connection ***** to db: 'unconnected' user: 'unauthenticated' host: '********' (This connection closed normally without authentication)
...
If these logs are also saved to the PV and log rotation doesn't work (or isn't quick enough), this might already be the culprit.
The logs I mentioned in the previous comment do not appear to be the problem and are just noise caused by probes from haproxy
(see upstream issue – also, the warnings say "This connection closed normally without authentication"). Furthermore I couldn't find any indication that these log messages are filling up the backing PVC for the pods – this hypothesis appears to be a dead end.
A more interesting finding is the continuously growing binlog (/bitnami/mariadb/data/mysql-bin.*
). It appears to grow more than 150 MB per day. I couldn't find anything particularly odd or non-working which can explain this. What I checked so far:
SHOW STATUS LIKE 'wsrep_%
)
wsrep_local_state_comment
is Synced
wsrep_evs_state
is OPERATIONAL
wsrep_cluster_status
is Primary
yaookctl status infra
shows all ready
for the cell0
/cell1
dbsSELECT table_schema, SUM(data_length + index_length) as size FROM information_schema.tables GROUP BY table_schema ORDER BY size
)So either a) the binlog is growing on its own, even if the data doesn't change, or b) the db size just doesn't appear to change, for example if rows are inserted and removed immediately. I am not sure if (a) is normal for a MariaDB galera cluster. I will check (b) with the general query log.
Independent of the cause of the binlog growth (which I want to investigate further), I noticed that the rotation/automatic purge configuration can certainly be improved to mitigate the volume overflow issue. The following values are too high:
binlog_expire_logs_seconds
is 1209600 seconds (14 days)max_binlog_size
is 1073741824 bytes (1 GB)Furthermore, against recommendation in the manual wsrep_mode
is set to REPLICATE_MYISAM
(it is experimental). This isn't needed because all tables use InnoDB and this engine is also the default storage engine. But I don't think that's the cause.
In the meantime and as a workaround, we reset the logs and increased the size of the PVCs to 50 GiB to have more time to investigate.
With the general query log enabled (SET GLOBAL general_log=1
), I see a lot of updates to the services
table:
$ tail -f /bitnami/mariadb/data/nova-cell1-62vdr-db-0.log | grep -E '(UPDATE|INSERT)'
...
12050 Query UPDATE services SET updated_at='2024-04-10 07:36:53.817869', report_count=17510, last_seen_up='2024-04-10 07:36:53.817216' WHERE services.id = 100
12051 Query UPDATE services SET updated_at='2024-04-10 07:36:53.842179', report_count=17509 WHERE services.id = 100
11091 Query UPDATE services SET updated_at='2024-04-10 07:36:54.018981', report_count=17511, last_seen_up='2024-04-10 07:36:54.018351' WHERE services.id = 88
12135 Query UPDATE services SET updated_at='2024-04-10 07:36:55.126298', report_count=17510, last_seen_up='2024-04-10 07:36:55.125728' WHERE services.id = 100
11737 Query UPDATE services SET updated_at='2024-04-10 07:36:55.174976', report_count=17509 WHERE services.id = 100
...
I am not sure if this is normal, but at least I know this is the cause for the growing binlog. I'll approach one of the Yaook developers in my team about the idea of reducing the binlog_expire_logs_seconds
and max_binlog_size
parameters as described above.
UPDATE:
The services referenced by the ids are nova-conductor-2
and nova-scheduler-0
. I checked the logs but found nothing really suspicious, except for a warning
2024-04-08 07:03:21 WARNING nova.scheduler.manager This periodic task should only be enabled on a single scheduler to prevent collisions between multiple schedulers: Host 'cmp-nihe-3136' mapping already exists
I increased the grep context and it appears there are in fact competing updates to the db which may lead to rollbacks and all these repeated attempts. Example for id=100
(nova-scheduler-0), numbers 55414 and 56315 are the connection ids:
$ tail -f /bitnami/mariadb/data/nova-cell1-62vdr-db-0.log | grep -C3 -E '(UPDATE|INSERT)'
...
55414 Query SELECT 1
56315 Query UPDATE services SET updated_at='2024-04-10 09:45:49.381453', report_count=18286, last_seen_up='2024-04-10 09:45:49.380873' WHERE services.id = 100
55414 Query SELECT services.created_at AS services_created_at, services.updated_at AS services_updated_at, services.deleted_at AS services_deleted_at, services.deleted AS services_deleted, services.id AS services_id, services.uuid AS services_uuid, services.host AS services_host, services.`binary` AS services_binary, services.topic AS services_topic, services.report_count AS services_report_count, services.disabled AS services_disabled, services.disabled_reason AS services_disabled_reason, services.last_seen_up AS services_last_seen_up, services.forced_down AS services_forced_down, services.version AS services_version
FROM services
WHERE services.deleted = 0 AND services.id = 100
LIMIT 1
56315 Query COMMIT
55414 Query UPDATE services SET updated_at='2024-04-10 09:45:49.386484', report_count=18282 WHERE services.id = 100
56315 Query ROLLBACK
55414 Query COMMIT
55414 Query ROLLBACK
...
(However, the ROLLBACK
always comes after the COMMIT
, which is IMHO pointless, but this could just be the way SQLAlchemy handles it. There is no BEGIN
or START TRANSACTION
so I assume everything happens in autocommit mode.)
Status update:
expire_logs_days = 2
(this is syntactic sugar for binlog_expire_logs_seconds
, see https://mariadb.com/kb/en/replication-and-binary-log-system-variables/#expire_logs_days). This fixes the symptom of the overflowing PVCs.last_seen_up
updates might happen.
report_interval
) for each service (e.g., nova-conductor-0
), as part of a dynamic timer implemented through threadgroup.ThreadGroup
class in the oslo_service
package.nova-conductor
and nova-schedulers
defaults to the number of CPU cores; and there's a higher chance that multiple threads pick up the same service id for a dynamic timer (which I believe is the culprit).Further stabilization was implemented by explicitly setting the nova settings conductor.workers
and scheduler.workers
to 8 each. I arrived at this value after interviewing other OpenStack experts and researching how projects such as Kolla Ansible (they max out at 5 workers) configure these settings.
This appears to be a "breakthrough" change: now I see the expected frequency of updates to the services
table when checking with the MariaDB general log. I'll leave this issue open for a few more days to observe the behavior, but from my perspective this should be solved.
FTR, while researching this, I also stumbled over https://github.com/openstack/puppet-openstacklib/blob/master/lib/facter/os_workers.rb (via https://docs.openstack.org/performance-docs/latest/test_plans/openstack_api_workers/plan.html) which has interesting formulas to calculate the worker numbers, also maxing out at a quite low value: 12.
I filed an upstream issue wrt improving the nova default settings: https://bugs.launchpad.net/nova/+bug/2063451.
I'll leave this issue open for a few more days to observe the behavior, but from my perspective this should be solved.
Everything is quiet now, PVCs are happy – closing.
In #544 we provisioned a productive SCS cluster which is compatible with the IaaS v4 certificate as a PoC for FITKO. We installed this cluster with Yaook on bare metal. After a while, we ran into some stabilization problems: there is no workload yet but the PVCs of the cell1 database pods are overflowing after some time (~ 1 week) for a yet unknown reason. This leads the cluster into a failure state and we can't create VMs anymore.