authgear / authgear-server

Open source alternative to Auth0 / Firebase Auth
https://www.authgear.com
Apache License 2.0
82 stars 37 forks source link

Remove elasticsearch from authgear #3395

Open tung2744 opened 1 year ago

tung2744 commented 1 year ago

Problem

To reduce the deployment requirement, consider removing elastic search as a dependency in Authgear.

Currently Elastic Search is used in the search function in the Admin API, See if PostgreSQL can implement the same features.

Appetite

2 weeks

Todo

See: https://www.postgresql.org/docs/current/datatype-textsearch.html

fungc-io commented 1 year ago

Older issue: https://github.com/authgear/authgear-server/issues/1978

tung2744 commented 11 months ago

I would like to add a config to authgear.yaml to configure search implementation:

search:
  implementation: "elasticsearch" # or "database"

And a secret config for the search database

secrets:
- data:
    database_schema: public
    database_url: postgres://postgres:postgres@127.0.0.1:5432/postgres?sslmode=disable
  key: search.db

Practically, maybe it is the same database as auth.

@louischan-oursky See if any opinion?

louischan-oursky commented 11 months ago

How does the search SQL query look like? Do we support all previous capabilities? Like searching gmail.com will find all users with any email address ended with @gmail.com. How well does PostgreSQL handle document in languages other English, like Chinese?

tung2744 commented 11 months ago

I found there are performance issues during implementation, and I am still tuning the query & index. Let me update here when I got a final query.

tung2744 commented 11 months ago

How well does PostgreSQL handle document in languages other English, like Chinese?

Given a text 香港長沙灣長順街11號長城工廠大廈,

Run this sql

SELECT * FROM jsonb_to_tsvector('english', '{"address": "香港長沙灣長順街11號長城工廠大廈"}', '["string", "numeric"]'::jsonb);

You will see this result

'香港長沙灣長順街11號長城工廠大廈':1

And given the text "香港,長沙灣,長順街,11號 長城工廠大廈"

You will see this result

'11號':4 '長城工廠大廈':5 '長沙灣':2 '長順街':3 '香港':1

We can observe that any continuous chinese characters are treated as one word, but if there are any punctuation in between it will be separated into multiple words.

louischan-oursky commented 11 months ago

Elasticsearch offers icu_analyzer, which includes icu_tokenizer. The ICU tokenizer uses Unicode segmentation algorithm to split CJK text into words. We may need to replicate the logic in Golang. https://github.com/rivo/uniseg

tung2744 commented 11 months ago

Performance Test

Schema

CREATE TABLE _search_user
(
    id                            text PRIMARY KEY,
    app_id                        text                        NOT NULL,
    app_ids                       text[]
      GENERATED ALWAYS AS (ARRAY[app_id]) STORED,
    created_at                    timestamp without time zone NOT NULL,
    updated_at                    timestamp without time zone NOT NULL,
    last_login_at                 timestamp without time zone,
    is_disabled                   boolean                     NOT NULL,
    emails                        jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    email_local_parts             jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    email_domains                 jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    preferred_usernames           jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    phone_numbers                 jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    phone_number_country_codes    jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    phone_number_national_numbers jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    oauth_subject_ids             jsonb                       NOT NULL DEFAULT '[]'::jsonb,
    details                       jsonb                       NOT NULL DEFAULT '{}'::jsonb,
    details_tsvector              tsvector
      GENERATED ALWAYS AS (jsonb_to_tsvector('english', details, '["string", "numeric"]')) STORED
);
CREATE INDEX _search_user_app_id ON _search_user (app_id);
CREATE INDEX _search_user_app_id_created_at ON _search_user (app_id, created_at);
CREATE INDEX _search_user_app_id_last_login_at ON _search_user (app_id, last_login_at);
CREATE INDEX _search_user_gin ON _search_user USING GIN (
  app_ids,
  emails,
  email_domains,
  email_local_parts,
  preferred_usernames,
  phone_numbers,
  phone_number_country_codes,
  phone_number_national_numbers,
  oauth_subject_ids,
  details_tsvector);

Data

Data generation script:

WITH id AS (
  SELECT generate_series(1,{{NUMBER_OF_ROWS}}) AS id
),
drows AS (
  SELECT
    id.id AS id,
    'app'::text || (id % 10)::text AS app_id,
    NOW() + make_interval(mins => id.id) AS created_at,
    NOW() + make_interval(mins => id.id) AS updated_at,
    CASE WHEN (id.id % 10 = 0) THEN NULL
      ELSE NOW() + make_interval(mins => id.id)
    END AS last_login_at,
    CASE WHEN (id.id % 10 = 1) THEN TRUE
      ELSE FALSE
    END AS is_disabled,
    CASE WHEN (id.id % 10 = 2) THEN '[]'
      ELSE '["' || ('user' || id.id::text || '@example.local') || '"]'
    END::jsonb AS emails,
    CASE WHEN (id.id % 10 = 2) THEN '[]'
      ELSE '["' || ('user' || id.id::text) || '"]'
    END::jsonb AS email_local_parts,
    CASE WHEN (id.id % 10 = 2) THEN '[]'
      ELSE '["example.local"]'
    END::jsonb AS email_domains,
    CASE WHEN (id.id % 10 = 3) THEN '[]'
      ELSE '["' || ('user' || id.id::text) || '"]'
    END::jsonb AS preferred_usernames,
    CASE WHEN (id.id % 10 = 4) THEN '[]'
      ELSE '["+852' || (40000000 + id.id) || '"]'
    END::jsonb AS phone_numbers,
    CASE WHEN (id.id % 10 = 4) THEN '[]'
      ELSE '["+852"]'
    END::jsonb AS phone_number_country_codes,
    CASE WHEN (id.id % 10 = 4) THEN '[]'
      ELSE '["' || (40000000 + id.id) || '"]'
    END::jsonb AS phone_number_national_numbers,
    CASE WHEN (id.id % 10 = 5) THEN '[]'
      ELSE '["oauth-' || (1000000000000 + id.id) || '"]'
    END::jsonb AS oauth_subject_ids,
    CASE WHEN id.id % 3 = 0 THEN format('{"name": "Chan Tai Man", "address": "Sha Tin, NT"}')
      WHEN id.id % 3 = 1 THEN format('{"name": "Wong Tai Sin", "address": "Tsim Sha Tsui, Kowloon"}')
      ELSE format('{"name": "Chi Sin", "address": "Causeway Bay, Hong Kong Island"}')
    END::jsonb AS details
  FROM id
)
INSERT INTO _search_user (
  id, app_id, created_at, updated_at, last_login_at,
  is_disabled, emails, email_local_parts, email_domains,
  preferred_usernames, phone_numbers,
  phone_number_country_codes, oauth_subject_ids, details,
) SELECT * FROM drows;

Query

Two queries were used to test:

-- Q1: This query match only a very small number of rows
EXPLAIN ANALYZE SELECT * FROM _search_user AS su
  WHERE
    su.app_id = 'app0' AND (
      su.app_ids @> ARRAY['app0'] AND (
        su.details_tsvector @@ websearch_to_tsquery('english', 'user100050') OR
        su.emails ? 'user100050' OR
        su.email_domains ? 'user100050' OR
        su.email_local_parts ? 'user100050'
      )
    )
  ORDER BY su.created_at DESC
  LIMIT 100;

-- Q2: This query match a lot of rows
EXPLAIN ANALYZE SELECT * FROM _search_user AS su
  WHERE
    su.app_id = 'app0' AND (
      su.app_ids @> ARRAY['app0'] AND (
        su.details_tsvector @@ websearch_to_tsquery('english', 'example.local') OR
        su.emails ? 'example.local' OR
        su.email_domains ? 'example.local' OR
        su.email_local_parts ? 'example.local'
      )
    )
  ORDER BY su.created_at DESC
  LIMIT 100;

Results

  1. Test with 10M rows. 1M rows in each app. (app0 - app9)

    • Q1: Very fast. The GIN index was used to find related rows which is only a few.
    • Q2: Very slow. The GIN index was used to find related rows, but as "example.local" matches a lot of rows, the result set is still large therefore it becomes very slow.
  2. Test with 1M rows. 100K rows in each app (app0 - app9)

    • Q1: Very fast. The GIN index was used to find related rows which is only a few.
    • Q2: In local, it returns in ~ 500ms which seems acceptable as a search query.

Findings

Possible actions

  1. Keep the search database small. (~1M rows) Consider using another database if that app has a lot of users. (Maybe > 10K)
  2. Ensure the GIN index matches a small amount of rows. It can be done by:
    • Remove ability to search by email domains or phone number country codes, which many rows will have a same value.
    • Add a minimum character number to use free text search, so that common words ("a", "the", "I"..) will be excluded.
  3. Partitioning of the _search_user using app_id (Need to further study, but I think it should be possible).

@louischan-oursky @fungc-io @chpapa

I've tried to do a preformance test on searching with postgresql, and found there will be performance issue when the query result is large enough. We might need to decide one of the approach I listed in "Possible actions". Or maybe see if you have other suggestions?

tung2744 commented 11 months ago
fungc-io commented 11 months ago
tung2744 commented 11 months ago

Comparison with elasticsearch

About Tokenizer

In elasticsearch, we used UAX URL Email Tokenizer for any searchable word. It is a tokenizer based on the standard tokenizer which uses Unicode Text Segmentation algorithm.

Unicode Text Segmentation

Trigram (pg_trgm)

This is similar to N-Gram Tokenizer of elasticsearch.

I didn't use trigram to implement the search because our original search is not using n-gram tokenizer. It might be useful if we want to search and sort user by similarity score.

Performance

As it is very time consuming to generate test data in elasticsearch, finally I only made a test with 1M rows with same app_id.

This test is performed with a single node elasticsearch in local.

Findings

When testing with a query which matches about ~80k rows:

fungc-io commented 10 months ago