[Other] Investigate why nova cell1 database in bare metal FITKO SCS cluster is overflowing

martinmo commented 5 months ago

In #544 we provisioned a productive SCS cluster which is compatible with the IaaS v4 certificate as a PoC for FITKO. We installed this cluster with Yaook on bare metal. After a while, we ran into some stabilization problems: there is no workload yet but the PVCs of the cell1 database pods are overflowing after some time (~ 1 week) for a yet unknown reason. This leads the cluster into a failure state and we can't create VMs anymore.

martinmo commented 5 months ago

More context: the cell db pods run a MariaDB Galera cluster (using the bitnami/mariadb-galera:11.0.4-debian-11-r6 image) as recommended by the Large Scale SIG. A quick glance at the logs of one of the pods (kubectl logs nova-cell0-x86fc-db-0 -n yaook) surfaces a lot of WARNING messages of the form, appearing every few seconds:

...
2024-04-09  9:22:08 61002 [Warning] Aborted connection ***** to db: 'unconnected' user: 'unauthenticated' host: '********' (This connection closed normally without authentication)
...

If these logs are also saved to the PV and log rotation doesn't work (or isn't quick enough), this might already be the culprit.

martinmo commented 5 months ago

The logs I mentioned in the previous comment do not appear to be the problem and are just noise caused by probes from haproxy (see upstream issue – also, the warnings say "This connection closed normally without authentication"). Furthermore I couldn't find any indication that these log messages are filling up the backing PVC for the pods – this hypothesis appears to be a dead end.

A more interesting finding is the continuously growing binlog (/bitnami/mariadb/data/mysql-bin.*). It appears to grow more than 150 MB per day. I couldn't find anything particularly odd or non-working which can explain this. What I checked so far:

Replication status is OK for all nodes (SHOW STATUS LIKE 'wsrep_%)
- wsrep_local_state_comment is Synced
- wsrep_evs_state is OPERATIONAL
- wsrep_cluster_status is Primary
yaookctl status infra shows all ready for the cell0/cell1 dbs
The db sizes are not changing (SELECT table_schema, SUM(data_length + index_length) as size FROM information_schema.tables GROUP BY table_schema ORDER BY size)

So either a) the binlog is growing on its own, even if the data doesn't change, or b) the db size just doesn't appear to change, for example if rows are inserted and removed immediately. I am not sure if (a) is normal for a MariaDB galera cluster. I will check (b) with the general query log.

Independent of the cause of the binlog growth (which I want to investigate further), I noticed that the rotation/automatic purge configuration can certainly be improved to mitigate the volume overflow issue. The following values are too high:

binlog_expire_logs_seconds is 1209600 seconds (14 days)
max_binlog_size is 1073741824 bytes (1 GB)

Furthermore, against recommendation in the manual wsrep_mode is set to REPLICATE_MYISAM (it is experimental). This isn't needed because all tables use InnoDB and this engine is also the default storage engine. But I don't think that's the cause.

In the meantime and as a workaround, we reset the logs and increased the size of the PVCs to 50 GiB to have more time to investigate.

martinmo commented 5 months ago

With the general query log enabled (SET GLOBAL general_log=1), I see a lot of updates to the services table:

$ tail -f /bitnami/mariadb/data/nova-cell1-62vdr-db-0.log | grep -E '(UPDATE|INSERT)'
...
12050 Query    UPDATE services SET updated_at='2024-04-10 07:36:53.817869', report_count=17510, last_seen_up='2024-04-10 07:36:53.817216' WHERE services.id = 100
12051 Query    UPDATE services SET updated_at='2024-04-10 07:36:53.842179', report_count=17509 WHERE services.id = 100
11091 Query    UPDATE services SET updated_at='2024-04-10 07:36:54.018981', report_count=17511, last_seen_up='2024-04-10 07:36:54.018351' WHERE services.id = 88
12135 Query    UPDATE services SET updated_at='2024-04-10 07:36:55.126298', report_count=17510, last_seen_up='2024-04-10 07:36:55.125728' WHERE services.id = 100
11737 Query    UPDATE services SET updated_at='2024-04-10 07:36:55.174976', report_count=17509 WHERE services.id = 100
...

I am not sure if this is normal, but at least I know this is the cause for the growing binlog. I'll approach one of the Yaook developers in my team about the idea of reducing the binlog_expire_logs_seconds and max_binlog_size parameters as described above.

UPDATE:

The services referenced by the ids are nova-conductor-2 and nova-scheduler-0. I checked the logs but found nothing really suspicious, except for a warning

2024-04-08 07:03:21 WARNING nova.scheduler.manager This periodic task should only be enabled on a single scheduler to prevent collisions between multiple schedulers: Host 'cmp-nihe-3136' mapping already exists

I increased the grep context and it appears there are in fact competing updates to the db which may lead to rollbacks and all these repeated attempts. Example for id=100 (nova-scheduler-0), numbers 55414 and 56315 are the connection ids:

$ tail -f /bitnami/mariadb/data/nova-cell1-62vdr-db-0.log | grep -C3 -E '(UPDATE|INSERT)'
...
55414 Query    SELECT 1
56315 Query    UPDATE services SET updated_at='2024-04-10 09:45:49.381453', report_count=18286, last_seen_up='2024-04-10 09:45:49.380873' WHERE services.id = 100
55414 Query    SELECT services.created_at AS services_created_at, services.updated_at AS services_updated_at, services.deleted_at AS services_deleted_at, services.deleted AS services_deleted, services.id AS services_id, services.uuid AS services_uuid, services.host AS services_host, services.`binary` AS services_binary, services.topic AS services_topic, services.report_count AS services_report_count, services.disabled AS services_disabled, services.disabled_reason AS services_disabled_reason, services.last_seen_up AS services_last_seen_up, services.forced_down AS services_forced_down, services.version AS services_version 
FROM services 
WHERE services.deleted = 0 AND services.id = 100 
LIMIT 1
56315 Query    COMMIT
55414 Query    UPDATE services SET updated_at='2024-04-10 09:45:49.386484', report_count=18282 WHERE services.id = 100
56315 Query    ROLLBACK
55414 Query    COMMIT
55414 Query    ROLLBACK
...

(However, the ROLLBACK always comes after the COMMIT, which is IMHO pointless, but this could just be the way SQLAlchemy handles it. There is no BEGIN or START TRANSACTION so I assume everything happens in autocommit mode.)

martinmo commented 5 months ago

Status update:

The cluster has been stabilized by setting expire_logs_days = 2 (this is syntactic sugar for binlog_expire_logs_seconds, see https://mariadb.com/kb/en/replication-and-binary-log-system-variables/#expire_logs_days). This fixes the symptom of the overflowing PVCs.
In an effort to find the root cause, I reviewed the OpenStack Nova sources to find hints why all these concurrent (and suspicious) last_seen_up updates might happen.
- The desired behavior is an update every 10s (default report_interval) for each service (e.g., nova-conductor-0), as part of a dynamic timer implemented through threadgroup.ThreadGroup class in the oslo_service package.
- During the hackathon in Arnsberg I also got some hints on how to debug this further, this is still ongoing in a devstack setup.
This doesn't seem to be specific to Yaook setups and the symptoms are "heavier" if the OpenStack control plane has a lot of CPU cores. This has to do with the fact the the number of workers/greenlets in nova-conductor and nova-schedulers defaults to the number of CPU cores; and there's a higher chance that multiple threads pick up the same service id for a dynamic timer (which I believe is the culprit).
Furthermore, in my understanding, such a PVC overflow can also happen with plain MariaDB setups (w/o Galera), because in the binlog can be active as well and might grow if not rotated properly.

martinmo commented 5 months ago

Further stabilization was implemented by explicitly setting the nova settings conductor.workers and scheduler.workers to 8 each. I arrived at this value after interviewing other OpenStack experts and researching how projects such as Kolla Ansible (they max out at 5 workers) configure these settings.

This appears to be a "breakthrough" change: now I see the expected frequency of updates to the services table when checking with the MariaDB general log. I'll leave this issue open for a few more days to observe the behavior, but from my perspective this should be solved.

FTR, while researching this, I also stumbled over https://github.com/openstack/puppet-openstacklib/blob/master/lib/facter/os_workers.rb (via https://docs.openstack.org/performance-docs/latest/test_plans/openstack_api_workers/plan.html) which has interesting formulas to calculate the worker numbers, also maxing out at a quite low value: 12.

martinmo commented 5 months ago

I filed an upstream issue wrt improving the nova default settings: https://bugs.launchpad.net/nova/+bug/2063451.

martinmo commented 5 months ago

I'll leave this issue open for a few more days to observe the behavior, but from my perspective this should be solved.

Everything is quiet now, PVCs are happy – closing.

SovereignCloudStack / standards

[Other] Investigate why nova cell1 database in bare metal FITKO SCS cluster is overflowing #557