Improve search and display capabilities for identities in the Flagsmith UI

matthewelwell commented 5 months ago

Currently, due to the large quantities of data involved in identity storage, and the way in which that data is stored in our SaaS platform to support the Edge API, searching and displaying additional data about identities can be very difficult.

Some of the main problems are:

It is not possible to search on another other than the identifier. This is problematic, particularly when introducing non-engineering users to Flagsmith since the identifier is often a unique key such as a uuid or similar which most users will not have access to.
Similar to the above, it is not possible to see at a quick glance from the list of identities which identities are which because we only show the identifier.
We do not show the total number of identities (only applicable to SaaS).

Note that this issue combines both #444 and #290.

matthewelwell commented 5 months ago

The key issue described above is (1). There are a few options that we can investigate here for a solution:

1. Add an alias function to our SDKs which will add a new, indexed, parameter to our identities which can then be searched across.

We would implement something like:

flagsmith.alias(identifier="<uuid>", alias="matthew.elwell")

This could get stored against the identity and displayed alongside the identifier in the list, and the search could search across both the identifier and the alias.

Pros:

Easiest to implement from an API perspective.
Would also (at least partly) resolve (2) above
Likely wouldn't require any additional infrastructure

Cons:

Not very flexible (e.g. doesn't allow users to define multiple values to search across)
Would require changes in all SDKs
Likely requires the creation of another index on our identities table in dynamodb which will increase costs

Note that as a temporary measure here, we could allow users to add an alias via the admin API, which would mean that customers could either do this from the dashboard, so that once an identity has been found once via their identifier, they could be aliased and found easier next time. Or, they could iterate over their identities via the management API and alias the identities.

2. Create a search index (where?) based on the traits for each identity

We could create a search index that looks something like:

trait_key_1:trait_value_1;trait_key_2;trait_value_2...

Then in the search field (or a separate search input), we could add the option to choose a trait to search by and then build the search query to do a full text search across this field building the query to look something like trait_key:trait_value to avoid hitting multiple trait keys that have similar values for example.

Note that we may want to have people define the traits that they want to be able to search on, rather than building the search index for all traits for all identities which might get unmanageably large.

Pros:

Most flexible / user friendly solution
No changes required to SDKs
This could also help us with being able to select trait keys from a list when e.g. creating segments

Cons:

Difficult to implement
Likely requires some additional infrastructure somewhere (e.g. elasticsearch?)
Building this search index for existing identities will be costly

matthewelwell commented 5 months ago

I think for SaaS (more specifically the Edge API), we'd want to look into using DynamoDB streams to trigger a lambda which will update a new model in Django which we can use to search across to get the results, before hitting dynamodb.

This will be a significant undertaking, however, probably a few weeks of work and testing, plus we would also need to work out how to migrate the data into the postgres models in the first place.

For self hosted, we could probably add this functionality quite easily by just directly searching across the traits as the data for a self hosted install would not be as large as for our SaaS environment.

novakzaballa commented 4 months ago

I think for SaaS (more specifically the Edge API), we'd want to look into using DynamoDB streams to trigger a lambda which will update a new model in Django which we can use to search across to get the results, before hitting dynamodb.

I love this idea for self-hosted, I can remember it was also suggested by @dabeeeenster to handle identity overrides in local evaluation. For SaaS, I recommend using a cheaper and more efficient solution for large data sets. This type of use case is ideal for a Data-Lake/Data-Warehouse solution. As I suggested several times we could:

Use AWS S3 as our data lake store because it would be cheap.
Store our data as parquet files which are plain text files (usually zippeed) organized in a columnar way to optimize access to large datasets.
Use any Apache Spark compatible Big Data DBMS to consume/query the data like Amazon Athena, Amazon Redshift, or any other.

That will allow the customers to make queries like:

Search identities by any trait value
List Identities of a segment
Export their data so they can analyze it with DatawareHouse tools like Tableu or data series analyzers like InFLuxDB

Another advantage is that in the future, we could offer data analysis ourselves if we want.

This would allow us to store all/any historical and non-operational data here and access it by any criteria, we could create materialized views for the most used access patterns, so we can allow our customers to access/analyze their information in any way. but that is out of the scope of this particular issue.

matthewelwell commented 2 months ago

I've begun investigating this a little further. I have made a start on some PoC code for option 2 in my comment above (here). See the WIP PR here.

Some important notes:

When using dynamo streams and global replication, it is sufficient to connect to a stream in a single region, all replicated writes will also trigger the stream.
I've done some basic maths on the AWS pricing (although don't quote me on it) and it doesn't look like it will be expensive. See additional calculations / notes here.

Questions to answer:

What service will actually consume the DDB stream? Probably Lambda? But then how do we eventually get the data into postgres? RDS proxy? An endpoint in the core API to queue a task?

matthewelwell commented 1 month ago

@kyle-ssg In this PR #4569 I have added a new field to the edge identities called "dashboard_alias".

From a FE perspective we need to 1:

Display it on the detail view of an identity
Allow an option to update it via the detail view of an identity
Add functionality to search by dashboard_alias (by simply searching for dashboard_alias:<alias>
Maybe tidy up my bad implementation of the dashboard alias in the list view?

Flagsmith / flagsmith

Improve search and display capabilities for identities in the Flagsmith UI #4016