artsy / README

:wave: - The documentation for being an Artsy Engineer
Creative Commons Attribution 4.0 International
1.1k stars 120 forks source link

RFC: Consolidate Eigen feature flags #536

Closed MounirDhahri closed 5 months ago

MounirDhahri commented 6 months ago

This RFC proposes the consolidation of feature flags within Eigen into a single platform: Unleash.

Reasoning

Misalignment of Feature Flags Across Surfaces

Across various Artsy surfaces, feature flags are managed using Unleash. Engineers are accustomed to adding flags there, using supporting tools (https://tools.artsy.net/feature-flags) and directly through https://unleash.artsy.net. However, Eigen continues to utilize Echo for feature toggles, leading to inconsistency and potential operational challenges.

Complexity in Running Experiments

Conducting A/B tests in Eigen poses challenges, necessitating the use of two flags: - An Echo flag to toggle when development is complete - An Unleash flag to define the split This dual-flag approach complicates experimentation and requires additional coordination.

Limitations in Rollout Strategies

Echo flags lack advanced rollout capabilities present in Unleash. Gradual rollouts and user ID-based rollouts, critical for controlled feature releases, are not easily achievable. Migrating to Unleash would provide these capabilities out of the box, enhancing our deployment strategies.

Management of Legacy Flags

Unlike our experiments infrastructure, Eigen lacks a fallback mechanism for missing flags from Echo. Consequently, outdated flags are retained in echo.json to prevent unintended feature toggling for legacy clients and reusing the same key. This approach is "unsustainable" and could benefit from a safer approach around archiving old flags.

Exceptions

No exceptions are warranted. No more using feature flags from Unleash.

Additional Context

Discussions on this matter have occurred previously, notably here and here.

Resolution of this RFC

  1. Agreement on retiring Echo and transitioning all feature flag management to Unleash.
  2. Formulation of a technical plan, documented and added to Sapphire's board for further action.
  3. Implementation of the proposed technical plan, following up on the suggested approach.
araujobarret commented 6 months ago

1000% in favor of that, thanks for taking the initiative!

Small question, are we keeping the same concept of readyForRelease when moving to Unleash or are we exploring a different approach, maybe by versions? Perhaps this question is more for the tech plan rather than here.

gkartalis commented 6 months ago

I am also in favor of that 100% there are some things though that we need to take into account before we start doing that:

lidimayra commented 6 months ago

I don't work closely with Eigen, and the technical implications are unclear to me. But if this is feasible, it would be great!

In the past, the former NX team took 1 month to realize a feature wasn't released due to our lack of understanding of feature flags handling in Eigen (ref). There are, of course, several other steps that we missed and should have prevented something like that. We all learned from it as a team. Still, if a unified process for feature flags is possible, it can make a huge difference eng-wide to avoid issues like that in the future.

joeyAghion commented 6 months ago

What happens with the already existing echo feature flags that we do have out there?

Since Echo deploys to a S3 bucket, we could potentially leave the static JSON file there indefinitely for legacy clients.

brainbicycle commented 6 months ago

mostly in favor of consolidating same question as @gkartalis here:

Echo also has some extra capabilities like killswitch for older versions should we keep it around just for this or should we explore other options? We could just deprecate feature flags from Echo and keep the rest of the stuff.

also I think we need some form of readyForRelease or a way of not shipping in progress work for this to replace Echo feature flags. Maybe warrants some thought since that continues to be a problem but with that existing I think there is still always a chance someone forgets to release something no matter if the flag comes from echo or unleash.

MounirDhahri commented 6 months ago

@araujobarret

Small question, are we keeping the same concept of readyForRelease when moving to Unleash or are we exploring a different approach, maybe by versions? Perhaps this question is more for the tech plan rather than here.

Imo, we should keep the same logic we have currently, the reason behind that is that we kind of get out of the box release all upcoming versions with it without much effort

MounirDhahri commented 6 months ago

@gkartalis

What happens with the already existing echo feature flags that we do have out there?

In order not to break any current behavior and to make this change seamless, I would suggest migrating the existing flags to unleash.

Since mobile releases are a bit different and older versions stay active for some users for a longer period what happens if somebody accidentally archive a feature flag that was created in unleash?

If the flag is archived, it's no longer going to be available to client SDKs. This triggers Eigen's logic of returning the fallback variant. This is IMO fine, it's actually better than now since we don't handle that. Ideally, we should not archive flags unless we see they're not getting any traffic.

We need an explicit naming conventions for the feature flags and the ab test flags.

We actually have spoke about this before here https://github.com/artsy/README/issues/507. Adding a feature flag from https://tools.artsy.net makes that quite easy, it's only not trivial when adding through https://unleash.artsy.net. Our current feature flags use AREnableXFeature pattern. This came from when we were only iOS and it was the preferred pattern, I am fine with moving over that.

Echo also has some extra capabilities like killswitch for older versions should we keep it around just for this or should we explore other options?

I believe it makes sense to move those to unleash as well, but initially, I would do @joeyAghion 's suggestion and keep them in json in s3. So far, in the past 3 years. I barely recall one time that we changed them, so I would be fine with migrating them only at a later stage

MounirDhahri commented 6 months ago

@lidimayra

In the past, the former NX team took 1 month to realize a feature wasn't released due to our lack of understanding of feature flags handling in Eigen (ref). There are, of course, several other steps that we missed and should have prevented something like that. We all learned from it as a team. Still, if a unified process for feature flags is possible, it can make a huge difference eng-wide to avoid issues like that in the future.

Same thing happened to us in Onyx and it was the initial thing that motivate me to look into this further, I notes some issues in Notion and shared them with the team at #ab-testing over slack. We came up with a checkbox to make sure that the best practices are clear. You can check it here https://github.com/artsy/eigen/blob/main/docs/a_b_testing_best_practices.md. I think a similar doc would be great for other surfaces as well

brainbicycle commented 6 months ago

we do change the killswitch semi-regularly: https://www.notion.so/artsy/0b3cda79f9884a77a4950798410dc753?v=6f87f04d4a644bc787ed861637238efc it is seldom but also a pretty critical functionality to be sure we can retire legacy features. some of this stuff is not really equivalent to a feature flag so not sure it belongs in unleash: https://github.com/artsy/echo/blob/5da4cf7efea38fecede02efe40bbda83a0339d32/Echo.json5#L178

( a fair bit probably can be retired though)

damassi commented 6 months ago

Been dreaming about this one for a long time! Thank you. The multiple FF layers are so confusing 👍

After wiring in Unleash into Folio the other day it also got me wondering why this layer is so complex in Eigen, and if it needs to be. As far as flag environments go, we've got the beginnings of convos in Folio, production flag is unchecked till its ready and things are hidden from users. When its ready, check the prod flag in our tools FF admin. If something goes wrong during the eventual release, we uncheck prod in tools and folks are none the wiser 🤷 And when the feature is stable, we remove the flag. Everything remains simple and its no big deal. There's a process, and guardrails.

If we (needed) another layer like ready for release (I don't think we do!) maybe we could spin up another env in Unleash. Right now we have staging / prod, and it (seems) like we could add one more. (I hope we don't do this though.)

Also, things like kill-switches and the like are pretty much unnecessary now thanks to CodePush.

damassi commented 6 months ago

Regarding A/B testing and enforced naming, because we have the ability to manage Feature Flags in our tools admin, this is trivial. We use the drop down to append the team name now:

Screenshot 2024-03-15 at 8 42 29 AM Screenshot 2024-03-15 at 8 44 37 AM

And if we needed to update this with another input that controls the naming via the experiment checkbox, we could also do that trivially:

Screenshot 2024-03-15 at 8 42 39 AM
brainbicycle commented 6 months ago

Also, things like kill-switches and the like are pretty much unnecessary now thanks to CodePush.

disagree on this one, codepush does not eliminate old versions existing in the wild, and codepush updates can only target compatible versions so old versions will be unaffected. Plus there is the possibility of bugs in native code or libraries not addressable by codepush.

brainbicycle commented 6 months ago

Been dreaming about this one for a long time! Thank you. The multiple FF layers are so confusing 👍

After wiring in Unleash into Folio the other day it also got me wondering why this layer is so complex in Eigen, and if it needs to be. As far as flag environments go, we've got the beginnings of convos in Folio, production flag is unchecked till its ready and things are hidden from users. When its ready, check the prod flag in our tools FF admin. If something goes wrong during the eventual release, we uncheck prod in tools and folks are none the wiser 🤷 And when the feature is stable, we remove the flag. Everything remains simple and its no big deal. There's a process, and guardrails.

If we (needed) another layer like ready for release (I don't think we do!) maybe we could spin up another env in Unleash. Right now we have staging / prod, and it (seems) like we could add one more. (I hope we don't do this though.)

I am confused about how this works with a flag on the server side. The scenario I am thinking about is a long running feature in dev behind a flag. While that flag is in various stages of dev multiple versions of the app get released, say versions 1, 2, 3. Since the flag is disabled great, no one sees it but the actual code in the app is incomplete until version 3. We release version 3 enable the feature flag and then users still on 1 and 2 see broken code. How do we get around this without a check in the client or a version check on the server?

damassi commented 6 months ago

We lean into app store releases and auto-updates, and if we need extra security around this we can wire up an intermediate layer. Additionally, we have CodePush! the app will reboot with latest, and there's nothing that a user can do about it. I do see your point tho 👍

Re kill-switch, we can have this extra security, but ideally we lean into this being a RN app. What's truly necessary? If we're not constantly pushing kill-switch-necessary native code updates, we shouldn't architect a system around something that's a rare occurrence.

brainbicycle commented 6 months ago

killswitch is something you ideally should need seldomly but are very happy to have when we do need it. basically every app I have worked on ended up building some functionality like this over time otherwise you get stuck maintaining old apis and functionality indefinitely. For example over the last few years we have used it to force upgrade users on old versions that were using old apis with security concerns. This targeted force upgrading is just not possible with codepush, it serves a different purpose in my mind. I guess my question would be why get rid of it if we already have it, just leave it in echo as peace of mind for when we need it. Why not just restrict this RFC to just deprecating feature flags from echo since that seems to be the main pain point?

separately, shipping in progress/broken code in my opinion is not acceptable and something we need to actively have a plan to prevent. It is just a reality of mobile dev that multiple versions of the app are going to exist in the wild, codepush or not. I think this is a very solvable problem with feature flags in unleash, just think it needs to be solved.

joeyAghion commented 6 months ago

Re. kill switches, I wonder if we (in the long term) can rely on Unleash for those too? Basically we'd just need a feature defined that specifies a minimum compatible app version. There seems to be some built-in support for this, for strategy constraints to govern whether a feature is enabled via semantic version comparisons.

MounirDhahri commented 5 months ago

Why not just restrict this RFC to just deprecating feature flags from echo since that seems to be the main pain point

To be honest, my primary objective with this RFC is to streamline the management of feature flags by consolidating them within Unleash rather than Echo. While I understand the importance of discussing other values within Echo, I believe it's best to address that separately. This isn't to dismiss their significance, as they do intersect with our current focus, but rather to ensure clarity and focus on resolving the confusion surrounding feature toggles and experiments. I will follow-up with these broader discussions separately at the practice

araujobarret commented 5 months ago

That's what I understood from this RFC, all the details would be discussed/agreed upon at the tech plan level considering the idea here and what we want to achieve is feasible.

brainbicycle commented 5 months ago

Alright but I am just reacting to: No exceptions are warranted. Echo's retirement is recommended. which makes it sound like we are killing echo so was a little concerned. Can we update this to reflect that we are only talking about feature flags?

MounirDhahri commented 5 months ago

Good point, will update @brainbicycle

MounirDhahri commented 5 months ago

@brainbicycle Updated the description as well as the RFC title. Thank you.

MounirDhahri commented 5 months ago

Resolution

I Spoke further about this with folks separately and we are good to go 🚀 . This work is now part of Sapphire OKRs for the next quarter. I will lead this effort in Eigen along with @mc-jones

Level of Support

2: Positive feedback.

Next Steps

Start with the tech plan

Exceptions

Only feature flags will be migrated from Echo, everything else will remain as it is.