Summary

The idea of this document is to describe a feature management system that will answer requests from the wallets with the correct feature flag variable depending on their identifier

Motivation

We need this mechanism to be able to safely roll out the new Wallet Service to the wallets, segmenting users with a controlled percentage so we have more control of exposure

The idea here is to have a simpler version of Optimizely

Guide-level explanation

For an initial implementation, these are the features that we need to implement:

Select a percentage of the user base for each feature flag
Use the user wallet's first_address to uniquely identify users
Return a boolean randomly but respecting the rollout percentage
Return the same value every time the same user (identified by the first_address requests the same feature flag
We need to have an environment for each feature flag, e.g. production, development

The API to request the feature flag value is as described:

GET /features/<environment>/<feature_flag>/<identifier>

Response

StatusCode: 200 Body:

{
  "value": true
}

Response on not found

StatusCode: 404 Body: empty

Decisions:

We will not have an audience table. The identifier of the context will be in the FeatureFlag identifier string. E.g.: mobile-wallet_service_rollout

We will have a version column on the UserFeatureFlag and the FeatureFlag tables that increases every time a FeatureFlag is updated. This is used so we will re-calculate the FeatureFlag for the users (on the stored value) every time the FeatureFlag percentage changes.

Reference-level explanation

Database design

`UserFeatureFlag`

identifier - User identifier, e.g. first_address feature_flag - The feature_flag identifier, e.g. mobile-wallet_service_rollout environment - The environment for this stored feature_flag, e.g.: production or staging value - The stored value for this feature flag for this user, this is randomized at the first request and then stored so we always respond the same value for the same user version - The FeatureFlag version of this UserFeatureFlag

`FeatureFlag`

identifier - The feature flag identifier, as a string. E.g.: mobile-wallet_service_rollout percentage - The percentage to use when deciding if the user version - This column is incremented every time the row is updated. This is used so we know when to invalidate the UserFeatureFlag for each user on request.

Architecture design

Lambda

Since this service will be hit every time an user opens an Hathor Wallet, it needs to be scalable enough to handle usage spikes

One potential drawback from using lambda is a risk of the cost getting exponentially high, but that can be prevented by always using hard values on functions timeout configurations.

We can also rate limit the APIs as the wallets will fallback to the old facade if they can not reach our feature management APIs.

Redis

The data we store is small enough to fit in memory, even with a large userbase, so I think Redis is the best option to deliver fast response times.

Redis offers different persistence options (from: https://redis.io/topics/persistence):

RDB (Redis Database): The RDB persistence performs point-in-time snapshots of your dataset at specified intervals.
AOF (Append Only File): The AOF persistence logs every write operation received by the server, that will be played again at server startup, reconstructing the original dataset. Commands are logged using the same format as the Redis protocol itself, in an append-only fashion. Redis is able to rewrite the log in the background when it gets too big.
No persistence: If you wish, you can disable persistence completely, if you want your data to just exist as long as the server is running.
RDB + AOF: It is possible to combine both AOF and RDB in the same instance. Notice that, in this case, when Redis restarts the AOF file will be used to reconstruct the original dataset since it is guaranteed to be the most complete.

I think it is fine to use the RDB persistence as losing UserFeatureFlag records (on a forced restart, for instance) is not a catastrophic failure -- the user will just get another randomly generated response from the server on the next request

Rationale and alternatives

We have previously discussed alternatives in https://github.com/HathorNetwork/internal-issues/issues/11

The conclusion was that we were going to use Optimizely as we already had a PoC working, but we found an incompatibility on our react-native version with their library and decided to take a step back and design this simplified solution to have more information before deciding.

I think that this initial implementation is simple enough to be developed and ready in a single sprint to fulfill our current need -- rollout the wallet service initially.

HathorNetwork / hathor-wallet-service-old

[Design] Initial rollout backend implementation #128