department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
283 stars 205 forks source link

Session key rotation deploy plan #27723

Open LindseySaari opened 3 years ago

LindseySaari commented 3 years ago

Wait for EKS.

Issue Description

In an effort to combat brute force decryption or other malicious attacks, session keys will be rotated on a monthly basis for vets-api. There is an issue with the rolling deploy where for a brief moment old and new instances may be up at the same time. For example, if a user creates a session and it's encrypted with the new key, but their subsequent request gets routed to an old server that's in the process of being torn down, this could result in an invalid session/decryption error for the user.

The session key rotation changes need to be deployed during the time of lowest traffic to avoid issues with the rolling deploy. After speaking with the analytics team, the 3-4am ET window consistently has the lowest traffic. In order to make this less of a burden on the BE Tools and/or Operations team, an automated deployment plan should be determined.

See the Current Rotation Documenation for additional info.


Tasks

Acceptance Criteria

omgitsbillryan commented 3 years ago

Some thoughts:

  1. A zero-downtime approach is preferable IMO. We used to (sort of still do) have some custom code that decrypts / encrypts session cookies for the purpose of populating sessions into redis for load testing. I think it could be repurposed in a "if new key doesn't decrypt then try old key" for a zero-downtime solution.
  2. I'm weary of having an automated "special" deployment go out at 3am w/ no one around to troubleshoot if things go wrong. I'd prefer a deployment during low-traffic hours, but still within reason for platform team members to troubleshoot. Such a deploy could/should also be preceded with some kind of "Monthly maintenance - you will be logged out at 10pm EST" banner on the site.
  3. You mention

    if a user creates a session and it's encrypted with the new key, but their subsequent request gets routed to an old server that's in the process of being torn down, this could result in an invalid session/decryption error for the user.

    The vets-api ELB uses connection draining. This means that once the new instances are InService, new connections will always be routed to the new instances and not the old ones. Does the same problem exist if a user comes from an old instance and goes to the new one?