department-of-veterans-affairs / abd-vro

To get Veterans benefits in minutes, VRO software uses health evidence data to help fast track disability claims.
Other
20 stars 6 forks source link

Deploy dead letter queue (DLQ) #3311

Closed lisac closed 2 months ago

lisac commented 3 months ago

summary

We anticipate potential complexities with deploying the dead letter queue feature (#3238). In particular, it might be necessary to sequence deployments of svc-bip-api and ep-merge in a particular order so that downtime can be minimized, if not avoided; and coordination might also be required with the RabbitMQ service. This ticket is is to track the approach and progress, for the purposes of 1) accounting for the work during the sprint; and 2) informing future situations where a change to RabbitMQ behavior needs to be deployed.

VRO planned to deploy this change on 8/14.

tentative plan of action

(source: #benefits-vro-engineering 8/14)

lisac commented 3 months ago

approach and findings from attempt # 1 to dev

Approach: triggered deployments of svc-bip-api and domain-ee-ep-merge-app to dev, using b995774 (latest on default branch at the time); observed an error in svc-bip-api app logs:

2024-08-14T22:00:09.275Z ERROR 1 --- [20.223.225:5672] o.s.a.r.c.CachingConnectionFactory : Shutdown Signal: channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - inequivalent arg 'auto_delete' for queue 'vroDeadLetterQueue' in vhost '/': received 'true' but current is 'false', class-id=50, method-id=10)

Restarted the apps (using Restart option on the respective Deployments in Lens), to see if that might have any effect; however, we continued to see the Shutdown Signal in svc-bip-api logs. (ep-merge logs were quiet, a consequence of no requests being fielded in dev).

In debugging, we inspected the queues in RabbitMQ. sample commands executed from the vro-rabbitmq pod, using guest:guest as the RabbitMQ user:password (actual values can be discovered when inspecting the pod):

# get the list of queues, displayed as pretty json.  
$ curl -s -u guest:guest http://localhost:15672/api/queues | json_pp

# look for existence of a specific queue, such as updateClaimContentionsResponseQueue
$ curl -s -u guest:guest http://localhost:15672/api/queues | json_pp | grep updateClaimContentionsResponseQueue 

# look for existence of a specific queue and display surrounding lines
$ curl -s -u guest:guest http://localhost:15672/api/queues | json_pp | grep -B 10 -A 10 updateClaimContentionsResponseQueue 

sample command to delete the BIP queues (7) (see also: https://github.com/department-of-veterans-affairs/abd-vro/blob/develop/svc-bip-api/src/main/resources/application.yaml#L35) :

curl -X DELETE -u guest:guest --location  'http://localhost:15672/api/queues/%2f/getClaimDetailsQueue' --header 'Content-Type: application/json'  
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/putClaimLifecycleStatusQueue' --header 'Content-Type: application/json' 
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/cancelClaimQueue' --header 'Content-Type: application/json' 
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/getClaimContentionsQueue' --header 'Content-Type: application/json' 
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/createClaimContentionsQueue' --header 'Content-Type: application/json' 
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/updateClaimContentionsQueue' --header 'Content-Type: application/json' 
curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/putTempStationOfJurisdictionQueue' --header 'Content-Type: application/json' 

sample code to delete the DLQ:

curl -X DELETE -u guest:guest --location 'http://localhost:15672/api/queues/%2f/vroDeadLetterQueue' --header 'Content-Type: application/json' 

After trying a number of things (argh, I didn't document the exact sequence - but at one point included deleting BIP and DLQ queues and re-starting apps) - we re-inspected the Shutdown Signal (posted near top of this comment), in particular this part: inequivalent arg 'auto_delete' for queue 'vroDeadLetterQueue' in vhost '/': received 'true' but current is 'false' This led to finding an issue in the queue declaration. potential fix in #3308.

Plan for next steps

  1. merge in #3308
  2. on dev: delete the vroDeadLetterQueue from rabbitmq OR redeploy rabbitmq (theory: the latter would have a new volume) - the point is to ensure there is not a vroDeadLetterQueue already in-place, and in theory the app code can create the queue without contention
  3. deploy svc-bip-api to dev; observe app logs for errors
  4. deploy ep-merge to dev; observe app logs for errors (although we don't expect to see any, due to the app not receiving traffic on the dev environment.
lisac commented 3 months ago

attempt # 2 to dev

  1. verified DLQ is deleted / does not exist

    curl -X DELETE -u <username:password> --location 'http://localhost:15672/api/queues/%2f/vroDeadLetterQueue' --header 'Content-Type: application/json'
  2. deployed 1615ee1 (contains #3308) of svc-bip-api and ep-merge

  3. verified the DLQ got created:

    curl -s -u  <username:password>   http://localhost:15672/api/queues | json_pp | grep -i name
  4. verified app logs of bip and ep-merge did not show fatal errors related to rabbitmq. ep-merge logs in general will have limited info on dev (due to low activity); but one sign that the app's connection with rabbitmq is ok:

    [2024-08-15 17:10:05] INFO     event=resumeJobsInProgress status=started total=0
    [2024-08-15 17:10:05] INFO     event=resumeJobsInProgress status=completed total=0
    INFO:     100.103.163.83:52084 - "GET /health HTTP/1.1" 200 OK
  5. to preview https://github.com/department-of-veterans-affairs/abd-vro/pull/3312, applied policy manually: rabbitmqctl set_policy vro-max-queue-length ".*" '{"max-length":1000}' --apply-to queues . as an experiment, created a new queue and observed the policy was applied (expected, considering the wildcard used in set_policy)

Plan for next steps

  1. deploy to qa. re-use steps 1-4 from above (attempt # 2 to dev)
  2. get a secRel-Approved image for bip and ep-merge
    • address SecRel issues in order to generate an image approved for sandbox and higher environments. OR
    • run SecRel on just bip and ep-merge. (if secrel failures are specific to domain-xamples)
  3. deploy the SecRel-signed image to sandbox, re-using steps 1-4
lisac commented 3 months ago

update on secRel (item 2 of preceding comment's next steps): attempt to get signed images by running SecRel on the two specific apps (rather than the full set of VRO apps) failed. i don't know whether this indicates that the specific apps have issues, or if it's that the worklow runs the Snyk static code analysis on all apps, regardless of which app(s) are targeted. but for the purposes of this deployment issue, noting that we don't have a SecRel-signed image that can be used for sandbox and above.

the specific SecRel attempts:

lisac commented 3 months ago

attempt # 1 to qa

  1. verified DLQ does not exist
  2. deployed 1615ee1 of svc-bip-api and ep-merge
  3. verified the DLQ got created
  4. verified app logs of bip and ep-merge did not show fatal errors related to rabbitmq. bip:
    2024-08-15T18:34:03.536Z  INFO 1 --- [           main] gov.va.vro.bip.config.RabbitMqConfig     : Creating dead letter exchange with name=vro.dlx
    2024-08-15T18:34:03.539Z  INFO 1 --- [           main] gov.va.vro.bip.config.RabbitMqConfig     : Creating dead letter queue with name=vroDeadLetterQueue
    [...]
    2024-08-15T18:34:06.423Z  INFO 1 --- [           main] o.s.a.r.c.CachingConnectionFactory       : Attempting to connect to: vro-rabbitmq:5672
    2024-08-15T18:34:06.640Z  INFO 1 --- [           main] o.s.a.r.c.CachingConnectionFactory       : Created new connection: connectionFactory#191774d6:0/SimpleConnection@1bbfd42f [delegate=amqp://guest@172.20.229.67:5672/, localPort=55284]
    2024-08-15T18:34:06.700Z  INFO 1 --- [           main] o.s.amqp.rabbit.core.RabbitAdmin         : Auto-declaring a non-durable or auto-delete Exchange (bipApiExchange) durable:true, auto-delete:true. It will be deleted by the broker if it shuts down, and can be redeclared by closing and reopening the connection.
    2024-08-15T18:34:06.703Z  INFO 1 --- [           main] o.s.amqp.rabbit.core.RabbitAdmin         : Auto-declaring a non-durable, auto-delete, or exclusive Queue (cancelClaimQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
    2024-08-15T18:34:06.704Z  INFO 1 --- [           main] o.s.amqp.rabbit.core.RabbitAdmin         : Auto-declaring a non-durable, auto-delete, or exclusive Queue (putTempStationOfJurisdictionQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
    2024-08-15T18:34:06.704Z  INFO 1 --- [           main] o.s.amqp.rabbit.core.RabbitAdmin         : Auto-declaring a non-durable, auto-delete, or exclusive Queue (getClaimContentionsQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
    2024-08-15T18:34:06.704Z  INFO 1 --- [           main] o.s.amqp.rabbit.core.RabbitAdmin         : Auto-declaring a non-durable, auto-delete, or exclusive Queue (createClaimContentionsQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
    2024-08-15T18:34:06.704Z  INFO 1 --- [           main] o.s.amqp.rabbit.core.RabbitAdmin         : Auto-declaring a non-durable, auto-delete, or exclusive Queue (updateClaimContentionsQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
    2024-08-15T18:34:06.704Z  INFO 1 --- [           main] o.s.amqp.rabbit.core.RabbitAdmin         : Auto-declaring a non-durable, auto-delete, or exclusive Queue (getClaimDetailsQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
    2024-08-15T18:34:06.704Z  INFO 1 --- [           main] o.s.amqp.rabbit.core.RabbitAdmin         : Auto-declaring a non-durable, auto-delete, or exclusive Queue (putClaimLifecycleStatusQueue) durable:true, auto-delete:true, exclusive:false. It will be redeclared if the broker stops and is restarted while the connection factory is alive, but all messages will be lost.
    2024-08-15T18:34:06.937Z  INFO 1 --- [           main] gov.va.vro.bip.BipApiApplication         : Started BipApiApplication in 16.614 seconds (process running for 19.37)

ep-merge:

[2024-08-15 18:31:20] INFO     event=resumeJobsInProgress status=started total=0
[2024-08-15 18:31:20] INFO     event=resumeJobsInProgress status=completed total=0
lisac commented 3 months ago

update on SecRel: should be addressed in https://github.com/department-of-veterans-affairs/abd-vro/pull/3313. 🤞
planning to deploy to dev and qa in the next half-hour.

lisac commented 3 months ago

deployment of 76583e6 to dev, qa, and sandbox successful. Applied the max-queue-length policy on sandbox.

Next step (Friday?):

Question: is it expected that the RabbitMQ instance on prod would already have in-place an exchange named vro.dlx? That's what I found, running rabbitmqctl list_exchanges. Also found when running that command from prod-test. I was surprised to see this, as I'd thought that this deployment would cause that exchange to be created; and we have not deployed this to prod). Note, I did not see vroDeadLetterQueue when i query the queues (rabbitmqctl list_queues). tl;dr: how concerned should we be that vro.dlx is established on the upper environments?

lisac commented 3 months ago

notes: commands for querying the specific vro.dlx exchange; and deleting it

curl -s -u <username:password> http://localhost:15672/api/exchanges/%2f/vro.dlx

curl -X DELETE -u <username:password> http://localhost:15672/api/exchanges/%2f/vro.dlx
lisac commented 3 months ago

tentative approach

to deploy the DLQ feature to an environment:

  1. delete queues that are declared by an app (eg delete BIP's queues) - why: the DLX won't be applied to an already established queue. for example, using getClaimDetailsQueue:

    curl -X DELETE -u <username:password> http://localhost:15672/api/queues/%2f/getClaimDetailsQueue
  2. restart / deploy the app

  3. verify that the queues are re-created, and that the dead letter exchange is specified in the arguments, eg:

    $ curl -s -u  <username:password> http://localhost:15672/api/queues/%2f/getClaimDetailsQueue | json_pp{
    "arguments" : {
      "x-dead-letter-exchange" : "vro.dlx"
    },
    "auto_delete" : true,
    "consumer_capacity" : 1,
    [...]
  4. Observe app logs. in particular, look out for Shutdown Signal and PRECONDITION_FAILED

next steps

@nelsestu @dfitchett @brostk : can you double-check me on this?

lisac commented 3 months ago

note: app deployments to prod-test were executed during a team huddle (~noon ET 8/16). however, we did not do the queue deletions and app restart noted in the preceding comment.

deployment logs: svc-bip-api : https://github.com/department-of-veterans-affairs/abd-vro-internal/actions/runs/10422973948 ep-merge: https://github.com/department-of-veterans-affairs/abd-vro-internal/actions/runs/10422977246

lisac commented 3 months ago

notes from 8/16 4:15pm ET huddle:

on prod-test:

lisac commented 2 months ago

closing, although incomplete. as a team, we've decided to take a different approach. namely, we will not be deploying the DLQ as implemented in #3238. ref: slack 8/19 12:00 ET.

BerniXiongA6 commented 2 months ago

Hey @lisac -- there was work put into this so what do you think about pointing this ticket after the fact?

lisac commented 2 months ago

@BerniXiongA6 Good call. I've just now assigned it a 2 in zenhub.