Closed lisac closed 2 months ago
Approach: triggered deployments of svc-bip-api
and domain-ee-ep-merge-app
to dev
, using b995774 (latest on default branch at the time); observed an error in svc-bip-api
app logs:
2024-08-14T22:00:09.275Z ERROR 1 --- [20.223.225:5672] o.s.a.r.c.CachingConnectionFactory : Shutdown Signal: channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - inequivalent arg 'auto_delete' for queue 'vroDeadLetterQueue' in vhost '/': received 'true' but current is 'false', class-id=50, method-id=10)
Restarted the apps (using Restart option on the respective Deployments in Lens), to see if that might have any effect; however, we continued to see the Shutdown Signal
in svc-bip-api
logs. (ep-merge
logs were quiet, a consequence of no requests being fielded in dev
).
In debugging, we inspected the queues in RabbitMQ. sample commands executed from the vro-rabbitmq
pod, using guest:guest
as the RabbitMQ user:password (actual values can be discovered when inspecting the pod):
# get the list of queues, displayed as pretty json.
$ curl -s -u guest:guest http://localhost:15672/api/queues | json_pp
# look for existence of a specific queue, such as updateClaimContentionsResponseQueue
$ curl -s -u guest:guest http://localhost:15672/api/queues | json_pp | grep updateClaimContentionsResponseQueue
# look for existence of a specific queue and display surrounding lines
$ curl -s -u guest:guest http://localhost:15672/api/queues | json_pp | grep -B 10 -A 10 updateClaimContentionsResponseQueue
sample command to delete the BIP queues (7) (see also: https://github.com/department-of-veterans-affairs/abd-vro/blob/develop/svc-bip-api/src/main/resources/application.yaml#L35) :
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/getClaimDetailsQueue' --header 'Content-Type: application/json'
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/putClaimLifecycleStatusQueue' --header 'Content-Type: application/json'
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/cancelClaimQueue' --header 'Content-Type: application/json'
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/getClaimContentionsQueue' --header 'Content-Type: application/json'
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/createClaimContentionsQueue' --header 'Content-Type: application/json'
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/updateClaimContentionsQueue' --header 'Content-Type: application/json'
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/putTempStationOfJurisdictionQueue' --header 'Content-Type: application/json'
sample code to delete the DLQ:
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/vroDeadLetterQueue' --header 'Content-Type: application/json'
After trying a number of things (argh, I didn't document the exact sequence - but at one point included deleting BIP and DLQ queues and re-starting apps) - we re-inspected the Shutdown Signal
(posted near top of this comment), in particular this part:
inequivalent arg 'auto_delete' for queue 'vroDeadLetterQueue' in vhost '/': received 'true' but current is 'false'
This led to finding an issue in the queue declaration. potential fix in #3308.
dev
: delete the vroDeadLetterQueue
from rabbitmq OR redeploy rabbitmq (theory: the latter would have a new volume) - the point is to ensure there is not a vroDeadLetterQueue
already in-place, and in theory the app code can create the queue without contentionsvc-bip-api
to dev
; observe app logs for errorsep-merge
to dev
; observe app logs for errors (although we don't expect to see any, due to the app not receiving traffic on the dev
environment.verified DLQ is deleted / does not exist
curl -X DELETE -u <username:password> --location 'http://localhost:15672/api/queues/%2f/vroDeadLetterQueue' --header 'Content-Type: application/json'
deployed 1615ee1
(contains #3308) of svc-bip-api and ep-merge
verified the DLQ got created:
curl -s -u <username:password> http://localhost:15672/api/queues | json_pp | grep -i name
verified app logs of bip and ep-merge did not show fatal errors related to rabbitmq.
ep-merge logs in general will have limited info on dev
(due to low activity); but one sign that the app's connection with rabbitmq is ok:
[2024-08-15 17:10:05] INFO event=resumeJobsInProgress status=started total=0
[2024-08-15 17:10:05] INFO event=resumeJobsInProgress status=completed total=0
INFO: 100.103.163.83:52084 - "GET /health HTTP/1.1" 200 OK
to preview https://github.com/department-of-veterans-affairs/abd-vro/pull/3312, applied policy manually: rabbitmqctl set_policy vro-max-queue-length ".*" '{"max-length":1000}' --apply-to queues
. as an experiment, created a new queue and observed the policy was applied (expected, considering the wildcard used in set_policy
)
qa
. re-use steps 1-4 from above (attempt # 2 to dev)sandbox
, re-using steps 1-4update on secRel (item 2 of preceding comment's next steps): attempt to get signed images by running SecRel on the two specific apps (rather than the full set of VRO apps) failed. i don't know whether this indicates that the specific apps have issues, or if it's that the worklow runs the Snyk static code analysis on all apps, regardless of which app(s) are targeted. but for the purposes of this deployment issue, noting that we don't have a SecRel-signed image that can be used for sandbox
and above.
the specific SecRel attempts:
2024-08-15T18:34:03.536Z INFO 1 --- [ main] gov.va.vro.bip.config.RabbitMqConfig : Creating dead letter exchange with name=vro.dlx
2024-08-15T18:34:03.539Z INFO 1 --- [ main] gov.va.vro.bip.config.RabbitMqConfig : Creating dead letter queue with name=vroDeadLetterQueue
[...]
2024-08-15T18:34:06.423Z INFO 1 --- [ main] o.s.a.r.c.CachingConnectionFactory : Attempting to connect to: vro-rabbitmq:5672
2024-08-15T18:34:06.640Z INFO 1 --- [ main] o.s.a.r.c.CachingConnectionFactory : Created new connection: connectionFactory#191774d6:0/SimpleConnection@1bbfd42f [delegate=amqp://guest@172.20.229.67:5672/, localPort=55284]
2024-08-15T18:34:06.700Z INFO 1 --- [ main] o.s.amqp.rabbit.core.RabbitAdmin : Auto-declaring a non-durable or auto-delete Exchange (bipApiExchange) durable:true, auto-delete:true. It will be deleted by the broker if it shuts down, and can be redeclared by closing and reopening the connection.
2024-08-15T18:34:06.703Z INFO 1 --- [ main] o.s.amqp.rabbit.core.RabbitAdmin : Auto-declaring a non-durable, auto-delete, or exclusive Queue (cancelClaimQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
2024-08-15T18:34:06.704Z INFO 1 --- [ main] o.s.amqp.rabbit.core.RabbitAdmin : Auto-declaring a non-durable, auto-delete, or exclusive Queue (putTempStationOfJurisdictionQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
2024-08-15T18:34:06.704Z INFO 1 --- [ main] o.s.amqp.rabbit.core.RabbitAdmin : Auto-declaring a non-durable, auto-delete, or exclusive Queue (getClaimContentionsQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
2024-08-15T18:34:06.704Z INFO 1 --- [ main] o.s.amqp.rabbit.core.RabbitAdmin : Auto-declaring a non-durable, auto-delete, or exclusive Queue (createClaimContentionsQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
2024-08-15T18:34:06.704Z INFO 1 --- [ main] o.s.amqp.rabbit.core.RabbitAdmin : Auto-declaring a non-durable, auto-delete, or exclusive Queue (updateClaimContentionsQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
2024-08-15T18:34:06.704Z INFO 1 --- [ main] o.s.amqp.rabbit.core.RabbitAdmin : Auto-declaring a non-durable, auto-delete, or exclusive Queue (getClaimDetailsQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
2024-08-15T18:34:06.704Z INFO 1 --- [ main] o.s.amqp.rabbit.core.RabbitAdmin : Auto-declaring a non-durable, auto-delete, or exclusive Queue (putClaimLifecycleStatusQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
2024-08-15T18:34:06.937Z INFO 1 --- [ main] gov.va.vro.bip.BipApiApplication : Started BipApiApplication in 16.614 seconds (process running for 19.37)
ep-merge:
[2024-08-15 18:31:20] INFO event=resumeJobsInProgress status=started total=0
[2024-08-15 18:31:20] INFO event=resumeJobsInProgress status=completed total=0
update on SecRel: should be addressed in https://github.com/department-of-veterans-affairs/abd-vro/pull/3313. 🤞
planning to deploy to dev
and qa
in the next half-hour.
deployment of 76583e6 to dev
, qa
, and sandbox
successful. Applied the max-queue-length policy on sandbox
.
Next step (Friday?):
prod-test
and prod
qa
@PaulKBaumann Question: is it expected that the RabbitMQ instance on prod would already have in-place an exchange named vro.dlx
? That's what I found, running rabbitmqctl list_exchanges
. Also found when running that command from prod-test
.
I was surprised to see this, as I'd thought that this deployment would cause that exchange to be created; and we have not deployed this to prod
). Note, I did not see vroDeadLetterQueue
when i query the queues (rabbitmqctl list_queues
). tl;dr: how concerned should we be that vro.dlx
is established on the upper environments?
notes:
commands for querying the specific vro.dlx
exchange; and deleting it
curl -s -u <username:password> http://localhost:15672/api/exchanges/%2f/vro.dlx
curl -X DELETE -u <username:password> http://localhost:15672/api/exchanges/%2f/vro.dlx
to deploy the DLQ feature to an environment:
delete queues that are declared by an app (eg delete BIP's queues) - why: the DLX won't be applied to an already established queue. for example, using getClaimDetailsQueue
:
curl -X DELETE -u <username:password> http://localhost:15672/api/queues/%2f/getClaimDetailsQueue
restart / deploy the app
verify that the queues are re-created, and that the dead letter exchange is specified in the arguments
, eg:
$ curl -s -u <username:password> http://localhost:15672/api/queues/%2f/getClaimDetailsQueue | json_pp{
"arguments" : {
"x-dead-letter-exchange" : "vro.dlx"
},
"auto_delete" : true,
"consumer_capacity" : 1,
[...]
Observe app logs. in particular, look out for Shutdown Signal
and PRECONDITION_FAILED
@nelsestu @dfitchett @brostk : can you double-check me on this?
note: app deployments to prod-test
were executed during a team huddle (~noon ET 8/16). however, we did not do the queue deletions and app restart noted in the preceding comment.
deployment logs:
svc-bip-api
: https://github.com/department-of-veterans-affairs/abd-vro-internal/actions/runs/10422973948
ep-merge
: https://github.com/department-of-veterans-affairs/abd-vro-internal/actions/runs/10422977246
notes from 8/16 4:15pm ET huddle:
on prod-test
:
rabbitmqctl list_queues
); and looking at ep-merge app logs, we thought we saw related log messages about creating the queues (almost immediately after we executed the DELETE queue). we attempted to delete the ep-merge pod - to buy time to run the deletions - but still failed to get to a state where the queues had the dlx argument. closing, although incomplete. as a team, we've decided to take a different approach. namely, we will not be deploying the DLQ as implemented in #3238. ref: slack 8/19 12:00 ET.
Hey @lisac -- there was work put into this so what do you think about pointing this ticket after the fact?
@BerniXiongA6 Good call. I've just now assigned it a 2 in zenhub.
summary
We anticipate potential complexities with deploying the dead letter queue feature (#3238). In particular, it might be necessary to sequence deployments of
svc-bip-api
andep-merge
in a particular order so that downtime can be minimized, if not avoided; and coordination might also be required with the RabbitMQ service. This ticket is is to track the approach and progress, for the purposes of 1) accounting for the work during the sprint; and 2) informing future situations where a change to RabbitMQ behavior needs to be deployed.VRO planned to deploy this change on 8/14.
tentative plan of action
(source: #benefits-vro-engineering 8/14)