Closed anjastrunk closed 1 month ago
(Note: we initially used #414 as the sole issue to track both the provisioning of the virtual and the bare metal test clusters. For the sake of better documentation, we retroactively created this separate issue for the bare metal setup.)
Current state is that the installation is completed and we have a working bare metal Yaook cluster. Thanks to the contributions of @cah-hbaum this OpenStack cluster is already SCS compliant, i.e., it fulfills the stabilized standards of the IaaS track (details can be found in #415 and in https://github.com/SovereignCloudStack/standards/tree/do-not-merge/scs-compliant-yaook/Informational). Here's a recap of what happened until we got to this point:
Initial preparation for the Yaook bare metal installation started at the beginning of March. This involved a "rehearsal" of the installation procedure on an existing, simpler test cluster, because for me this was the first time conducting such an install and the hardware for the actual installation was not yet commisioned.
During the rehearsal we already had network setup issues we needed to work around:
The installation of the bare metal test cluster was conducted between March 11th and March 19th but we again bumped into a lot of technical difficulties. Debugging and fixing these was a bit more time consuming than usual because I am not yet 100% accustomed to the interactions of all the components.
match
keyword.)clean failed
state. It took some time to debug, but the Yaook bare metal logs contained the hint ("certificate not yet valid") and we finally figured out this was caused by an extremely out of sync hardware clock.The next step will involve moving the hardware from our facility to its final location.
@cah-hbaum Please provide YAML output for SCS compatible IaaS v4, to proof cluster is SCS compliant.
The next step will involve moving the hardware from our facility to its final location.
Status update:
Status update for multiple working days:
Status update:
------------------------------------------------------------------------
Benchmark Run: Thu May 02 2024 07:01:31 - 07:29:25
8 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 49374945.7 lps (10.0 s, 7 samples) Double-Precision Whetstone 7331.2 MWIPS (8.8 s, 7 samples) Execl Throughput 4367.4 lps (29.7 s, 2 samples) File Copy 1024 bufsize 2000 maxblocks 1700297.0 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 465706.0 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 4516692.6 KBps (30.0 s, 2 samples) Pipe Throughput 2643777.7 lps (10.0 s, 7 samples) Pipe-based Context Switching 249035.2 lps (10.0 s, 7 samples) Process Creation 4239.2 lps (30.0 s, 2 samples) Shell Scripts (1 concurrent) 3404.9 lpm (60.0 s, 2 samples) Shell Scripts (8 concurrent) 7774.8 lpm (60.0 s, 2 samples) System Call Overhead 2399837.6 lps (10.0 s, 7 samples)
System Benchmarks Index Score 1995.8
Benchmark Run: Thu May 02 2024 07:29:25 - 07:57:36 8 CPUs in system; running 8 parallel copies of tests
Dhrystone 2 using register variables 396360992.9 lps (10.0 s, 7 samples) Double-Precision Whetstone 58344.0 MWIPS (9.8 s, 7 samples) Execl Throughput 20889.7 lps (29.9 s, 2 samples) File Copy 1024 bufsize 2000 maxblocks 12927118.6 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 3677514.2 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 22497528.7 KBps (30.0 s, 2 samples) Pipe Throughput 21037325.6 lps (10.0 s, 7 samples) Pipe-based Context Switching 1958050.1 lps (10.0 s, 7 samples) Process Creation 44864.0 lps (30.0 s, 2 samples) Shell Scripts (1 concurrent) 65052.2 lpm (60.0 s, 2 samples) Shell Scripts (8 concurrent) 9420.2 lpm (60.0 s, 2 samples) System Call Overhead 19695065.7 lps (10.0 s, 7 samples)
System Benchmarks Index Score 13756.0
@shmelkin Can you please share how you benchmarked the VM? I would like to add this to the docs as a sample for a benchmark. We only documented fio at the moment.
Okay, this is interesting. Basically it's a "regression" from the OVS setups we are used to.
In OpenvSwitch/L3-Agent based setups, the NAT rules for ingress (and I suppose also egress) traffic for floating IPs were set up no matter whether the port to which the floating IP was bound was ACTIVE or DOWN.
In OVN, the NAT rules are only set up when the port is up.
That breaks a specific use case, which is the use of VRRP/keepalived in VMs to implement custom load balancers or other HA solutions.
(In particular, this breaks yaook/k8s which we tried to run as a "burn in" test.)
I'll bring this up in next week's IaaS call.
We looked more into the OVN issue and it seems the only viable workaround is using allowed-address
on the non VRRP port. This is somewhat sad, we'll discuss it in the IaaS call tomorrow.
In osism/terraform-base (used by osism/testbed) we do it this way (allowed-address) as well (VRRP is only used inside the virtual network and the managed VIP is only accessed from inside the same virtual network):
We do not reserve the VIPs by creating unassigned Neutron ports because we work with static IP addresses in osism/terraform-base. This is therefore not necessary.
It also looks as if this has always been the way independent of OVN. At least https://www.codecentric.de/wissens-hub/blog/highly-available-vips-openstack-vms-vrrp comes from a time when IMO there was no OVN in OpenStack (or OVN itself?).
Searched for some more references:
I think @berendt is right, if this would work without allowed_address_pairs
you would have a security issue I think.
By default strict filters only allow the configured subnets and associated macs to pass. Via allowed_address_pairs you are given an allowlist to extend this where needed, e.g. for VRRP.
If it would work the other way around arbitrary L2 or L3 traffic would be allowed to flow, which is of course insecure?
Summary of working day:
Summary of multiple working days:
We will upgrade the deployment to the current release of yaook and kubernetes 1.27.10 to probably solve the cinder issue and provide a up-to-date-ish version.
Today, I started by preparing necessary artifacts (WIP)
We also need to backup/snapshot the healthmonitor, which will be tested later today and done right before the upgrade.
I will document here further
As outlined in the last comment, the deployment was upgraded to the current release of Yaook in an effort to solve the Cinder issues.
Right now, the health monitor VM isn't running anymore and we're in the progress of restoring its operation (this means, the links and the badges in the README currently don't work or show the compliance test as failed).
After a whole lot of trouble, the deployment is ready and functional. We
- updated to 1.27.10 k8s and OpenStack Zed with yaook
How do we deal with this? From SCS's point of view, we require OpenStack 2023.2 as the minimum version for OpenStack and not Zed.
- updated to 1.27.10 k8s and OpenStack Zed with yaook
How do we deal with this? From SCS's point of view, we require OpenStack 2023.2 as the minimum version for OpenStack and not Zed.
Oh really? Where exactly is that mentioned, I couldn't find that during my quick search.
@berendt in regards to the OpenStack version there is nothing (to my knowledge) currently in the standards. We require the OpenStack powered compute 2022.11 alongside the standards referenced here: https://docs.scs.community/standards/scs-compatible-iaas
Is there any functional reason that Zed (as in this case) would not be sufficient?
So why do we put ourselves under the burden of going through every upgrade in the reference implementation if it is not necessary? I had actually already assumed that we want to demand very up-to-date Openstack (and also Ceph and Kubernetes)? We also require the CSPs to have upgraded within 1 month. So that wouldn't be necessary?
The discussion for the update-window of the reference implementation is (imho) not directly connected, since that discussion is about providing support, security-updates etc. for the reference implementation. If there is no functional reason and our conformance tests succeed (as well as the openstack powered compute 2022.11) and as such the established standards are complied with, I see no reason to require a specific OpenStack version. Especially not if we require that certain APIs are fulfilled - since that basically allows a compatibility on the API-level (and would even allow an API-wrapper if it behaves correctly). Am I overlooking something?
Having an up-to-date reference implementation is worth pursuing outside of pure standard conformance, imho.
Update for multiple working days:
Today I noticed, via the health monitor, that the apimon
loop couldn't create volumes anymore. apimon
logs showed that the volume quota was exceeded. Usually, APImon performs cleanup of old volumes, but it always failed with:
volume status must be available or error or error_restoring or error_extending or error_managing and must not be migrating, attached, belong to a group, have snapshots, awaiting a transfer, or be disassociated from snapshots after volume transfer
This was caused by dangling volumes which were still shown as attaching
, attached
or in-use
, although openstack server list
didn't show any running APIMonitor instances anymore. These, in turn, seem to appear because instance creation fails with errors like:
Build of instance 68d71a43-e26e-4b34-8d93-37d5b29ab6bb aborted: Unable to update the attachment. (HTTP 500)
which – I think – leaves the Cinder database in an inconsistent state.
I stopped the apimon temporarily. Using cinder reset-state --state available --attach-status detached VOLUME_UUID
for the respective volumes, I was able to reset the volume states so they could be deleted. I am now looking into the HTTP 500
"Unable to update the attachment" error.
I can only reproduce this with the load generated by the apimon. Volume operations do not always, but quite reliably fail. Based on my findings there is something going on with the message queue.
In the cinder API logs, I see timeouts waiting for a reply for an operation, such as attachment_update
. The volume manager says it couldn't send a reply because the queue was gone in the meantime. (Maybe the whole operation is just too slow and some kind of garbage collection for the queue kicks in too early?)
Here is an relevant example snippet from the cinder volume manager logs:
...
2024-06-11 14:12:49 INFO cinder.volume.manager Created volume successfully.
2024-06-11 14:12:51 INFO cinder.volume.manager attachment_update completed successfully.
2024-06-11 14:12:51 INFO cinder.volume.manager attachment_update completed successfully.
2024-06-11 14:12:51 INFO cinder.volume.manager attachment_update completed successfully.
2024-06-11 14:12:51 INFO cinder.volume.manager attachment_update completed successfully.
2024-06-11 14:13:51 WARNING oslo_messaging._drivers.amqpdriver reply_3a06eb20db5d46318534af475f7c46c5 doesn't exist, drop reply to 1ffd3e1fb22d4007bc6b16c5d784f430
2024-06-11 14:13:51 ERROR oslo_messaging._drivers.amqpdriver The reply 1ffd3e1fb22d4007bc6b16c5d784f430 failed to send after 60 seconds due to a missing queue (reply_3a06eb20db5d46318534af475f7c46c5). Abandoning...
2024-06-11 14:13:51 INFO cinder.volume.manager Terminate volume connection completed successfully.
2024-06-11 14:13:51 WARNING oslo_messaging._drivers.amqpdriver reply_3a06eb20db5d46318534af475f7c46c5 doesn't exist, drop reply to 73b1e4390fd74484ab8cbfbb7e376ad2
...
And the matching ERROR with the same message ID 1ffd3e1fb22d4007bc6b16c5d784f430 from the cinder API logs:
...
cinder-api-6494ffc69b-7jcwm: 2024-06-11 14:13:50 ERROR cinder.api.v3.attachments Unable to update the attachment.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 441, in get
return self._queues[msg_id].get(block=True, timeout=timeout)
File "/usr/local/lib/python3.8/site-packages/eventlet/queue.py", line 322, in get
return waiter.wait()
File "/usr/local/lib/python3.8/site-packages/eventlet/queue.py", line 141, in wait
return get_hub().switch()
File "/usr/local/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch
return self.greenlet.switch()
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/cinder/api/v3/attachments.py", line 250, in update
self.volume_api.attachment_update(context,
File "/usr/local/lib/python3.8/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.8/site-packages/cinder/coordination.py", line 200, in _synchronized
return f(*a, **k)
File "/usr/local/lib/python3.8/site-packages/cinder/volume/api.py", line 2535, in attachment_update
self.volume_rpcapi.attachment_update(ctxt,
File "/usr/local/lib/python3.8/site-packages/cinder/rpc.py", line 200, in _wrapper
return f(self, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/cinder/volume/rpcapi.py", line 479, in attachment_update
return cctxt.call(ctxt,
File "/usr/local/lib/python3.8/site-packages/oslo_messaging/rpc/client.py", line 189, in call
result = self.transport._send(
File "/usr/local/lib/python3.8/site-packages/oslo_messaging/transport.py", line 123, in _send
return self._driver.send(target, ctxt, message,
File "/usr/local/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send
return self._send(target, ctxt, message, wait_for_reply, timeout,
File "/usr/local/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 678, in _send
result = self._waiter.wait(msg_id, timeout,
File "/usr/local/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in wait
message = self.waiters.get(msg_id, timeout=timeout)
File "/usr/local/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 443, in get
raise oslo_messaging.MessagingTimeout(
oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 1ffd3e1fb22d4007bc6b16c5d784f430
cinder-api-6494ffc69b-7jcwm: 2024-06-11 14:13:50 INFO cinder.api.openstack.wsgi HTTP exception thrown: Unable to update the attachment.
cinder-api-6494ffc69b-7jcwm: 2024-06-11 14:13:50 INFO cinder.api.openstack.wsgi https://cinder-api.yaook.svc:8776/v3/2fc014a6dd014c0bb3b53494a5b86fa9/attachme
nts/a788db16-75b0-46e8-b407-e7415b455427 returned with HTTP 500
cinder-api-6494ffc69b-7jcwm: 2024-06-11 14:13:50 INFO eventlet.wsgi.server 10.2.1.10,127.0.0.1 "PUT /v3/2fc014a6dd014c0bb3b53494a5b86fa9/attachments/a788db16-
75b0-46e8-b407-e7415b455427 HTTP/1.1" status: 500 len: 400 time: 60.0660729
...
Summary for multiple working days:
rabbitmqctl cluster_health
and with rabbitmq-check.py from the yaook debugbox -> everything happy.oslo_messaging
and cinder
sources to grasp how queues are created -> culprit found, it's a caching issue.
oslo_messaging
library creates reply queues with a random name on first use and caches the reply queue name in the RabbitDriver
/AMQPDriverBase
object, to avoid creating too many queues (see method _get_reply_q()
in amqpdriver.py). So basically, there is a cached reply queue name for every cinder API worker (green)thread.[error] <0.27146.0> operation queue.declare caused a channel exception not_found: queue 'reply_ed3fbf13f2e7488c9cbd19f0ad13588d' in vhost '/' process is stopped by supervisor
nova-scheduler
, nova-conductor
and so on.Update for the last few days:
YAOOK_OP_VERSIONS_OVERRIDE
approach to roll out the patched images for nova, nova-compute, cinder and glance without having to wait for the release of new Yaook Operator Helm Charts.rabbit_quorum_queue=true
setting in the [oslo_messaging_rabbit]
section.rabbitmqctl list_queues name type state
reveals that for the reply queues and some others, classic queues are still used. (It appears these are the transient queues.)oslo.messaging
, there are some semi recent changes related to this:rabbit_transient_quorum_queue
to enable the use of quorum for transient queues (reply_
and _fanout_
)Update wrt RabbitMQ / volume service issues:
rabbit_transient_quorum_queue
can't be used, this setting is too new :(heartbeat_rate
-> in Zed, default is 2, recommended is 3 (see https://review.opendev.org/c/openstack/oslo.messaging/+/875615) Update:
Update: today we rolled out new RabbitMQ version and applied tuned RabbitMQ settings for all four (colocated) RabbitMQ instances (each in turn 3x replicated) to reduce CPU contention on the (OpenStack) control plane nodes.
The last change finally did resolve our RabbitMQ issue. There are no more stopped reply queues left after the rolling restart (i.e., after RabbitMQ failover) and there are also no OpenStack API errors anymore, even during the restart.
Can this be closed?
Actually, yes :tada:
Provide a productive SCS cluster as PoC for FITKO. The cluster MUST be set up with the open source Lifecycle Management Tool for OpenStack and K8S Yaook and must be SCS compliant.
In contrast to #414, productive SCS cluster is set up on bare metal.