department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 197 forks source link

Feature Toggle Review/Refactoring #45395

Open LindseySaari opened 2 years ago

LindseySaari commented 2 years ago

Description

We should revisit the current feature toggle implementation to see if there is any way to improve. Feature toggles currently utilize Redis (Elasticache) and see ~30,000 get commands per second during business days/hours. This is due to the high number of calls to the FeatureToggles#index endpoint, which currently returns ALL feature toggles for a given request.

Background Information

We should review this endpoint to see if there is any room for improvement. Currently the vets-website does not have any uniformity when it comes to feature toggle key names (some are snake_case and some are camelCase). Additionally, the index action returns ALL toggles due to an IE11 character limit (See comments and PR). Do we even support IE11 anymore?

Other notes:

  1. We currently have ~200 feature toggles. These should be reviewed for any "dead" toggles
  2. The prod redis instance sees 30x more get commands than any of the other envs (due to feature toggle calls)
  3. While reviewing vets-api latency issues, it doesn't appear that Redis calls are a factor. Elasticache supports up to 65k concurrent connections.
  4. See the FeatureToggles#index APM board
  5. Does 1 query = 1 connection? If so, feature toggles hold about half of the connections for a given second if Elasticache can support 65k concurrent connections?

Tasks

Acceptance Criteria

How to configure this issue

RachalCassity commented 2 years ago

Currently making a list of unused Feature Toggles

RachalCassity commented 2 years ago

Feature Toggle Discovery

bah-manish commented 2 years ago

Story created for comparison too: https://app.zenhub.com/workspaces/vft-59c95ae5fda7577a9b3184f8/issues/department-of-veterans-affairs/va.gov-team/46156

joeniquette commented 1 year ago

completed from the perspective of what can be done without engineer interaction.

@little-oddball @pjhill Thoughts on messaging teams that still have these around and are "still being called" but likely just need to be merged in as real code instead of feature toggle code? Probably should spin a new ticket for that.

pjhill commented 1 year ago

Based on Rachal's document, it looks like there are around 57 feature toggles that have valid routes. Are you discussing the idea of nudging teams to merge and remove those feature toggles? We would only want to nudge teams with particularly old feature toggles. I'm thinking this is a quality issue related to improper application of the SDLC that VA.gov proposes.

It should be quite doable to use the VFS Product Directory to reach out to the PMs responsible for each corresponding product and ask them to take action.

We may want to start by updating the Feature toggles documentation, to indicate feature toggles should have a limited lifespan and maybe even propose a number for that limited lifespan like... 60 days.

joeniquette commented 1 year ago

Yea thats exactly what I was thinking Peter. Set a timeline of how long a toggle should live for, with the expectation that the code is now officially ready to be "permanent" and get rid of the flipper part of it. It should always only be seen as a temporary integration solution. 60 days seems fair to me. Anyone who currently has feature toggles over this time period I think we should message them and get them to modify their code to no longer depend on the feature toggle component of it.

joeniquette commented 1 year ago

@pjhill any updates on the "standard flipper timeline" work you planned to do? And any team reach out completed yet?

pjhill commented 1 year ago

@pjhill any updates on the "standard flipper timeline" work you planned to do? And any team reach out completed yet?

npeterson54 commented 1 year ago

Removing tech team 3 label

dcloud commented 1 year ago

IDK if this Datadog link will work, but it looks like the multiple calls to redis cache for a single V0::FeatureTogglesController#index can be slow. This 1.37s trace shows a number of rails.cache GET, redis.command GET calls at ~100ms apiece adding up. 10 of them turn into 1s of time, and the other hundreds of other calls make up most of the rest. Seems like some sort of "get many" or "get all" call, e.g. redis> KEYS flipper/*, might help.