Feature Suggestion: Multi-region support for ELS

Is your feature request related to a problem? Please describe.

We are trying to scale out FeatBit across multiple regions, in particular the evaluation server (ELS) but have found it is dependent on the same Redis instance that the API server uses, rather than being capable of using different ones.

Describe the solution you'd like

We would like to be able to use a separate Redis instance for the ELS so apart from the first call / startup of the ELS it should be essentially self-contained and not require talking to the main API / mongo / redis instance.

Describe alternatives you've considered

We have also considered using the FeatBit Agent but the manual sync requirement is a dealbreaker for us, and it also seems somewhat unfinished with a lack of a dockerfile/container/Helm chart.

Looking through the code of the ELS we can see it uses a hybrid store which prefers Redis but can fall back to MongoDB if Redis is unavailable for some reason. This is closer to what we're looking for, but 1. it really doesn't like you removing the Redis connection string right now (we're using Pro model so Kafka for messaging) and 2. it doesn't cache the data it loads from MongoDB into memory or Redis.

We have also considered, and are still considering writing our own program to make this work, but ideally we want to work with you on a good solution that benefits everyone.

Additional context

Let me describe our rough infrastructure setup so you can see what we're talking about:

So we have a "global" region for resources which must sit in one location, and this is where FeatBit is sat currently. Then we have a bunch of other regions around the global, for example EU or US that contain apps that need to read feature flags using the FeatBit client. We want to reduce latency and improve resiliency against our "global" region potentially going down by not having anything in our other various regions directly relying on it if possible.

This is what we were hoping the evaluation server could do, act as essentially a middle-man between our apps and the "global" featbit API / mongo / etc and use it's own Redis instance to cache data locally to that particular region. What we found though is that the featbit API updates Redis directly so it doesn't get updated by the ELS when it receives the update message over Kafka, that appears to only be for updating live connected clients as far as we can tell.

As mentioned we explored the FeatBit agent which seems to be closer to the design we want, but seems unfinished and requires manual syncing at this time. We're happy for the individual instances of the ELS in each region to reach out to the API server or mongo etc to grab the initial feature flag data, but from then on it should be able to serve them from it's own Redis cache without touching that "global" region. And of course any updates should come through the Kafka queue and also get saved in it's own Redis cache too.

Is this feature something you're interested in working on?

Yes, we are interested / happy to work with you on implementing this if we can agree on a design that everyone is happy with, so this issue is us opening the conversation to see what you think about this.

If we can't agree on a design then we will likely build our own app for this use case, but hopefully as mentioned we can contribute these improvements back for everyone's benefit 🙂

Also this is all based off a light reading of the FeatBit source code so if we have any assumptions wrong here please correct us!

Thank you!

Hi @MattJeanes , thanks for your input ❤️

I can see two options for us:

Option 1

We can let ELS handle updating the Redis cache. That way, each ELS instance can have its own cache, and they can manage it on their own. To implement that, we can add a new working mode to ELS, maybe we call it "agent" mode. In agent mode, the ELS first populate the Redis cache through the Global MongoDB, then consumes message from Kafka to update it's own cache(Redis or In-Memory).

If we still use Redis cache, a little problem would be if we have multiple ELS instances in one region, they'll probably share one Redis instance, so when there's an update to the data, they'll both try to update the Redis cache because they all need to consume the data update message (so they can push the data update to their downstream SDKs).

Option 2

Add real-time update support (Polling or Long Polling or WebSocket) to the FeatBit Agent. Polling/Long Polling is easy to implement, while if we want to use WebSocket, more work needs to be done.

The problem is the lack of container/helmchart support, but I think it shouldn't be too difficult to add.

Personally, I prefer the second option because I think it's a more general solution. What do you think?

Hey @deleteLater / team!

That's a great point about the multiple ELS instances in one region. It would break the services design of being scalable really doing that, so we probably shouldn't go for that option.

As for upgrading the FeatBit Agent, we think this is indeed better but there is some additional concerns I wanted to run by you:

The FeatBit agent currently only caches to a sqlite database, which in a containerised world might mean it gets wiped every time the container is re-created. It would be fantastic if we could make this sync to a Redis instance in the same way that the API does, which leads on to...
It seems that the FeatBit agent isn't really designed to be horizontally scaled which presents a challenge compared to using the ELS as we would like to have at minimum 2 containers for reliability, and for scale as we'll be connecting these to potentially hundreds of containers using feature flags.
The FeatBit agent does not currently appear to support feeding back analytics to the data analytics stuff e.g. clickhouse

Something that could be quite interesting to solve these issues is if the FeatBit agent cached to Redis in the same format that the API does then we could use the ELS as we originally planned. That would allow us to easily horizontally scale, fix feeding back analytics and basically solve all the problems as best as I can tell.

So in effect, the FeatBit Agent essentially becomes a broker to sync the flags into separate Redis instances, which the ELS can then use. And if Redis is unavailable for some reason the ELS can fall back to the central MongoDB store as it currently would.

Also, I'm a DevOps / Platform Engineer so I can definitely get the containerisation / Helm chart side of things done for you. We also have another backend / C# dev who I might be able to bring on to help out with the actual C# code side of things.

How do you feel about this?

Sorry for the late response. We haven't had much time to work on this project lately.

I think using the FeatBit Agent as "a broker to sync the flags into separate Redis instance" is a good solution.

To implement that we need to add a new working mode called "broker" to FeatBit Agent. Currently we've defined two working modes: Agent[not implemented yet] and Offline. When working in "broker" mode, the FeatBit Agent will pull data updates from the API server and update the data in Redis, which ELS can then use.

Do you think that's ok?

Hey, no worries about the late response although we have started to work on our own solution for this in the meantime.

Our own solution we're exploring at the moment is pretty similar to how the agent is described in the docs and uses a single connection to the ELS to pull all the latest flag information which it'll also cache in its own Redis store and then downstream clients can connect to it in place of the ELS to provide that resiliency and improved latency.

We're not done yet, and as such that design is not final, but if this works we're happy to contribute this back to be integrated into the proper FeatBit Agent if that's something you're interested in.

Note that currently it's a standalone .NET 8 agent app we've built rather than modifying the existing agent for now, but they could be merged together in the future.

The difference between our and your ideas seems to be that your version still uses the ELS downstream while ours uses it upstream and takes it's place for end clients.

One problem your version does solve though is that scaling out ours will also scale out saving the data to Redis multiple times which isn't a big deal but maybe a little hacky, and it's from what I understand the reason that the API service saves to Redis instead of the ELS.

We wanted to write a solution that worked without needing to modify any FeatBit code which is why we are using the ELS upstream to grab all the flag data from but your idea may work better as a proper long term solution.

Ultimately the problem we're dealing with is that the API only updates the one Redis instance when flags are changed which was the main reason we couldn't just drop the ELS into multiple regions with their own Redis connection strings.

We do note that the ELS currently has a mongo connection string, but if I remember correctly from reading the code it's only used as a backup if it can't retrieve the flags from Redis for some reason. This is fine but we need to make sure this is optional and the app can start up and work with only a Redis connection string set.

In other words we want to make sure that even if our region containing the primary FeatBit infrastructure goes offline the individual regions used by our product can at least still start up and serve the latest feature flags from their local Redis caches.

I think the main issue we have with the current FeatBit Agent is that the storing/loading from Redis is not yet implemented - only in memory - so if the agent has to restart it won't work if it can't reach the API server, and that the sync is manual and not real time. If we could fix those two things I wonder what the difference would be between the broker and agent mode, so I'm curious if it would be perhaps better to solve those problems instead of a new mode?

it's from what I understand the reason that the API service saves to Redis instead of the ELS

I don't remember why, but it was certainly one of the reasons we did it.

it's only used as a backup if it can't retrieve the flags from Redis for some reason. This is fine but we need to make sure this is optional and the app can start up and work with only a Redis connection string set.

Yes, mongodb is only used as a fallback but currently the mongodb connection string must be set. To make it optional, we need to

update HealthCheckBuilderExtensions.cs
we should make the MongoDbClient optional or initialize it lazily

If we could fix those two things I wonder what the difference would be between the broker and agent mode, so I'm curious if it would be perhaps better to solve those problems instead of a new mode?

Yeah, I was mistaken that. We just need to implement agent mode.

Currently implemented offline mode

Use sqlite as persistent store
The API server can "bootstrap" the agent, passing feature flags and segments data in and the agent stores them in both memory and sqlite, so the next time when the agent starts, it populates the memory cache by reading the sqlite.
Store insight data (via '/track' api) and SyncHistory in sqlite

If we want to finish the agent mode, I think there's a lot of work needs to be done.

How about we just write a data-synchronizer program to synchronize data from the global MongoDb to regional Redis?

We were trying to use the evaluation server as an agent because the agent does not currently have the agent mode implemented. and if we could run the evaluation server in that way then we would not need a separate agent running in agent mode.

As @MattJeanes said, we're currently in the process of implementing our own agent that runs in agent mode so once it is ready we would be happy to contribute that back into the FeatBit agent, if agent mode is still needed.

It really depends on how much effort it would be to write a data synchroniser vs agent mode, and whether it would be better long term to run the evaluation server in this way or have a separate agent with agent mode.

featbit / featbit

Feature Suggestion: Multi-region support for ELS #665

Option 1

Option 2