tung2744 commented 1 year ago

Problem

To reduce the deployment requirement, consider removing elastic search as a dependency in Authgear.

Currently Elastic Search is used in the search function in the Admin API, See if PostgreSQL can implement the same features.

Appetite

2 weeks

Todo

See: https://www.postgresql.org/docs/current/datatype-textsearch.html

[x] Add a config to switch search implementation
[x] Create a new table in auth database _search_user
- id: text
- app_id: text
- ... Convert all "term" in current elasticsearch query into columns.
- document: tsvector This column should be concatenated by all columns "query" in the current elasticsearch query.
[x] Re-implement Service.ReindexUser to upsert / delete rows in _search_user.
[x] Re-implement Service.QueryUser to convert the search keyword into tsquery.
[ ] ~Deprecate elasticsearch secrets~
[x] Do some performance tests
[x] Write script for reindexing users given a app id

fungc-io commented 1 year ago

Older issue: https://github.com/authgear/authgear-server/issues/1978

tung2744 commented 11 months ago

I would like to add a config to authgear.yaml to configure search implementation:

search:
  implementation: "elasticsearch" # or "database"

And a secret config for the search database

secrets:
- data:
    database_schema: public
    database_url: postgres://postgres:postgres@127.0.0.1:5432/postgres?sslmode=disable
  key: search.db

Practically, maybe it is the same database as auth.

@louischan-oursky See if any opinion?

louischan-oursky commented 11 months ago

How does the search SQL query look like? Do we support all previous capabilities? Like searching gmail.com will find all users with any email address ended with @gmail.com. How well does PostgreSQL handle document in languages other English, like Chinese?

tung2744 commented 11 months ago

I found there are performance issues during implementation, and I am still tuning the query & index. Let me update here when I got a final query.

tung2744 commented 11 months ago

How well does PostgreSQL handle document in languages other English, like Chinese?

Given a text 香港長沙灣長順街11號長城工廠大廈,

Run this sql

SELECT * FROM jsonb_to_tsvector('english', '{"address": "香港長沙灣長順街11號長城工廠大廈"}', '["string", "numeric"]'::jsonb);

You will see this result

'香港長沙灣長順街11號長城工廠大廈':1

And given the text "香港，長沙灣，長順街，11號長城工廠大廈"

You will see this result

'11號':4 '長城工廠大廈':5 '長沙灣':2 '長順街':3 '香港':1

We can observe that any continuous chinese characters are treated as one word, but if there are any punctuation in between it will be separated into multiple words.

'english' is the text search configuration. Running \dF with psql can get a list of available configs. In my local they are

Schema   |    Name    |              Description              
------------+------------+---------------------------------------
pg_catalog | arabic     | configuration for arabic language
pg_catalog | danish     | configuration for danish language
pg_catalog | dutch      | configuration for dutch language
pg_catalog | english    | configuration for english language
pg_catalog | finnish    | configuration for finnish language
pg_catalog | french     | configuration for french language
pg_catalog | german     | configuration for german language
pg_catalog | hungarian  | configuration for hungarian language
pg_catalog | indonesian | configuration for indonesian language
pg_catalog | irish      | configuration for irish language
pg_catalog | italian    | configuration for italian language
pg_catalog | lithuanian | configuration for lithuanian language
pg_catalog | nepali     | configuration for nepali language
pg_catalog | norwegian  | configuration for norwegian language
pg_catalog | portuguese | configuration for portuguese language
pg_catalog | romanian   | configuration for romanian language
pg_catalog | russian    | configuration for russian language
pg_catalog | simple     | simple configuration
pg_catalog | spanish    | configuration for spanish language
pg_catalog | swedish    | configuration for swedish language
pg_catalog | tamil      | configuration for tamil language
pg_catalog | turkish    | configuration for turkish language

louischan-oursky commented 11 months ago

Elasticsearch offers icu_analyzer, which includes icu_tokenizer. The ICU tokenizer uses Unicode segmentation algorithm to split CJK text into words. We may need to replicate the logic in Golang. https://github.com/rivo/uniseg

tung2744 commented 11 months ago

Performance Test

Schema

CREATE TABLE _search_user
(
    id                            text PRIMARY KEY,
    app_id                        text                        NOT NULL,
    app_ids                       text[]
      GENERATED ALWAYS AS (ARRAY[app_id]) STORED,
    created_at                    timestamp without time zone NOT NULL,
    updated_at                    timestamp without time zone NOT NULL,
    last_login_at                 timestamp without time zone,
    is_disabled                   boolean                     NOT NULL,
    emails                        jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    email_local_parts             jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    email_domains                 jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    preferred_usernames           jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    phone_numbers                 jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    phone_number_country_codes    jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    phone_number_national_numbers jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    oauth_subject_ids             jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    details                       jsonb                       NOT NULL DEFAULT '{}'::jsonb,
    details_tsvector              tsvector
      GENERATED ALWAYS AS (jsonb_to_tsvector('english', details, '["string", "numeric"]')) STORED
);
CREATE INDEX _search_user_app_id ON _search_user (app_id);
CREATE INDEX _search_user_app_id_created_at ON _search_user (app_id, created_at);
CREATE INDEX _search_user_app_id_last_login_at ON _search_user (app_id, last_login_at);
CREATE INDEX _search_user_gin ON _search_user USING GIN (
  app_ids,
  emails,
  email_domains,
  email_local_parts,
  preferred_usernames,
  phone_numbers,
  phone_number_country_codes,
  phone_number_national_numbers,
  oauth_subject_ids,
  details_tsvector);

Data

Data generation script:

WITH id AS (
  SELECT generate_series(1,{{NUMBER_OF_ROWS}}) AS id
),
drows AS (
  SELECT
    id.id AS id,
    'app'::text || (id % 10)::text AS app_id,
    NOW() + make_interval(mins => id.id) AS created_at,
    NOW() + make_interval(mins => id.id) AS updated_at,
    CASE WHEN (id.id % 10 = 0) THEN NULL
      ELSE NOW() + make_interval(mins => id.id)
    END AS last_login_at,
    CASE WHEN (id.id % 10 = 1) THEN TRUE
      ELSE FALSE
    END AS is_disabled,
    CASE WHEN (id.id % 10 = 2) THEN '[]'
      ELSE '["' || ('user' || id.id::text || '@example.local') || '"]'
    END::jsonb AS emails,
    CASE WHEN (id.id % 10 = 2) THEN '[]'
      ELSE '["' || ('user' || id.id::text) || '"]'
    END::jsonb AS email_local_parts,
    CASE WHEN (id.id % 10 = 2) THEN '[]'
      ELSE '["example.local"]'
    END::jsonb AS email_domains,
    CASE WHEN (id.id % 10 = 3) THEN '[]'
      ELSE '["' || ('user' || id.id::text) || '"]'
    END::jsonb AS preferred_usernames,
    CASE WHEN (id.id % 10 = 4) THEN '[]'
      ELSE '["+852' || (40000000 + id.id) || '"]'
    END::jsonb AS phone_numbers,
    CASE WHEN (id.id % 10 = 4) THEN '[]'
      ELSE '["+852"]'
    END::jsonb AS phone_number_country_codes,
    CASE WHEN (id.id % 10 = 4) THEN '[]'
      ELSE '["' || (40000000 + id.id) || '"]'
    END::jsonb AS phone_number_national_numbers,
    CASE WHEN (id.id % 10 = 5) THEN '[]'
      ELSE '["oauth-' || (1000000000000 + id.id) || '"]'
    END::jsonb AS oauth_subject_ids,
    CASE WHEN id.id % 3 = 0 THEN format('{"name": "Chan Tai Man", "address": "Sha Tin, NT"}')
      WHEN id.id % 3 = 1 THEN format('{"name": "Wong Tai Sin", "address": "Tsim Sha Tsui, Kowloon"}')
      ELSE format('{"name": "Chi Sin", "address": "Causeway Bay, Hong Kong Island"}')
    END::jsonb AS details
  FROM id
)
INSERT INTO _search_user (
  id, app_id, created_at, updated_at, last_login_at,
  is_disabled, emails, email_local_parts, email_domains,
  preferred_usernames, phone_numbers,
  phone_number_country_codes, oauth_subject_ids, details,
) SELECT * FROM drows;

Query

Two queries were used to test:

-- Q1: This query match only a very small number of rows
EXPLAIN ANALYZE SELECT * FROM _search_user AS su
  WHERE
    su.app_id = 'app0' AND (
      su.app_ids @> ARRAY['app0'] AND (
        su.details_tsvector @@ websearch_to_tsquery('english', 'user100050') OR
        su.emails ? 'user100050' OR
        su.email_domains ? 'user100050' OR
        su.email_local_parts ? 'user100050'
      )
    )
  ORDER BY su.created_at DESC
  LIMIT 100;

-- Q2: This query match a lot of rows
EXPLAIN ANALYZE SELECT * FROM _search_user AS su
  WHERE
    su.app_id = 'app0' AND (
      su.app_ids @> ARRAY['app0'] AND (
        su.details_tsvector @@ websearch_to_tsquery('english', 'example.local') OR
        su.emails ? 'example.local' OR
        su.email_domains ? 'example.local' OR
        su.email_local_parts ? 'example.local'
      )
    )
  ORDER BY su.created_at DESC
  LIMIT 100;

Results

Test with 10M rows. 1M rows in each app. (app0 - app9)
- Q1: Very fast. The GIN index was used to find related rows which is only a few.
- Q2: Very slow. The GIN index was used to find related rows, but as "example.local" matches a lot of rows, the result set is still large therefore it becomes very slow.
Test with 1M rows. 100K rows in each app (app0 - app9)
- Q1: Very fast. The GIN index was used to find related rows which is only a few.
- Q2: In local, it returns in ~ 500ms which seems acceptable as a search query.

Findings

When using GIN indexes, B-tree indexes basically cannot help speeding up the query.
If the query text matches a lot of rows in the GIN index, the query will become slow. (Slow step is Recheck Cond).
If the GIN index doesn't exist, the B-tree index will be used.
- In this case, if LIMIT is small, and OFFSET is not large:
- The query will be fast if there are a lot of matching rows, because the index scan can stop once it find enough rows.
- The query will be slow if there are not much matching rows, because the whole index will need to be scanned.

Possible actions

Keep the search database small. (~1M rows) Consider using another database if that app has a lot of users. (Maybe > 10K)
Ensure the GIN index matches a small amount of rows. It can be done by:
- Remove ability to search by email domains or phone number country codes, which many rows will have a same value.
- Add a minimum character number to use free text search, so that common words ("a", "the", "I"..) will be excluded.
Partitioning of the _search_user using app_id (Need to further study, but I think it should be possible).

@louischan-oursky @fungc-io @chpapa

I've tried to do a preformance test on searching with postgresql, and found there will be performance issue when the query result is large enough. We might need to decide one of the approach I listed in "Possible actions". Or maybe see if you have other suggestions?

tung2744 commented 11 months ago

Try to use pg_trgm (Or other icu analyzer)
Performance compared with elasticsearch
Study partition by app id

fungc-io commented 11 months ago

list out where in Authgear full text search is used and see what will become not supported if elastic search is removed

tung2744 commented 11 months ago

Comparison with elasticsearch

About Tokenizer

In elasticsearch, we used UAX URL Email Tokenizer for any searchable word. It is a tokenizer based on the standard tokenizer which uses Unicode Text Segmentation algorithm.

Unicode Text Segmentation

Originally, elasticsearch's standard tokenizer tokenize the text with Unicode Text Segmentation algorithm.
We will use https://github.com/rivo/uniseg to replicate the behavior by preprocessing the indexed document and the search query.
In this algorithm, chinese characters are always a single token: "中文字" -> ["中", "文", "字"]. English will be tokenized by words: "Some english word" -> ["some", "english", "word"].

Trigram (pg_trgm)

This is similar to N-Gram Tokenizer of elasticsearch.

I didn't use trigram to implement the search because our original search is not using n-gram tokenizer. It might be useful if we want to search and sort user by similarity score.

Performance

As it is very time consuming to generate test data in elasticsearch, finally I only made a test with 1M rows with same app_id.

This test is performed with a single node elasticsearch in local.

Findings

When testing with a query which matches about ~80k rows:

The first search is slow, it takes ~5s.
The later requests are fast, <100ms.
After the first request, even I changed the search conditions it is still fast, < 100ms.
It will become slow again once I updated the indexed data.
Pagination doesn't affect performance if we use search_after.
There is a limit on using offset (The from parameter), which is by default 10000.
Total Counts will only show 10000 if it is > 10000, but settings "track_total_hits": true can force it to return a accurate count. The impact on performance is not observable in this test.

fungc-io commented 10 months ago

We will keep elasticsearch in the SaaS cloud version
For self-deploy clients, if the user count is less than 100,000, they can opt to use postgresql for search to eliminate the dependency of elasticsearch in their deployment. If the user count is very huge, using elasticsearch is still recommended.

authgear / authgear-server

Remove elasticsearch from authgear #3395

Problem

Appetite

Todo

Performance Test

Schema

Data

Query

Results

Findings

Possible actions

Comparison with elasticsearch

About Tokenizer

Unicode Text Segmentation

Trigram (pg_trgm)

Performance

Findings