Open tung2744 opened 1 year ago
I would like to add a config to authgear.yaml
to configure search implementation:
search:
implementation: "elasticsearch" # or "database"
And a secret config for the search database
secrets:
- data:
database_schema: public
database_url: postgres://postgres:postgres@127.0.0.1:5432/postgres?sslmode=disable
key: search.db
Practically, maybe it is the same database as auth.
@louischan-oursky See if any opinion?
How does the search SQL query look like? Do we support all previous capabilities? Like searching gmail.com
will find all users with any email address ended with @gmail.com
. How well does PostgreSQL handle document in languages other English, like Chinese?
I found there are performance issues during implementation, and I am still tuning the query & index. Let me update here when I got a final query.
How well does PostgreSQL handle document in languages other English, like Chinese?
Given a text 香港長沙灣長順街11號長城工廠大廈
,
Run this sql
SELECT * FROM jsonb_to_tsvector('english', '{"address": "香港長沙灣長順街11號長城工廠大廈"}', '["string", "numeric"]'::jsonb);
You will see this result
'香港長沙灣長順街11號長城工廠大廈':1
And given the text "香港,長沙灣,長順街,11號 長城工廠大廈"
You will see this result
'11號':4 '長城工廠大廈':5 '長沙灣':2 '長順街':3 '香港':1
We can observe that any continuous chinese characters are treated as one word, but if there are any punctuation in between it will be separated into multiple words.
'english'
is the text search configuration. Running \dF
with psql can get a list of available configs. In my local they are
Schema | Name | Description
------------+------------+---------------------------------------
pg_catalog | arabic | configuration for arabic language
pg_catalog | danish | configuration for danish language
pg_catalog | dutch | configuration for dutch language
pg_catalog | english | configuration for english language
pg_catalog | finnish | configuration for finnish language
pg_catalog | french | configuration for french language
pg_catalog | german | configuration for german language
pg_catalog | hungarian | configuration for hungarian language
pg_catalog | indonesian | configuration for indonesian language
pg_catalog | irish | configuration for irish language
pg_catalog | italian | configuration for italian language
pg_catalog | lithuanian | configuration for lithuanian language
pg_catalog | nepali | configuration for nepali language
pg_catalog | norwegian | configuration for norwegian language
pg_catalog | portuguese | configuration for portuguese language
pg_catalog | romanian | configuration for romanian language
pg_catalog | russian | configuration for russian language
pg_catalog | simple | simple configuration
pg_catalog | spanish | configuration for spanish language
pg_catalog | swedish | configuration for swedish language
pg_catalog | tamil | configuration for tamil language
pg_catalog | turkish | configuration for turkish language
Elasticsearch offers icu_analyzer, which includes icu_tokenizer. The ICU tokenizer uses Unicode segmentation algorithm to split CJK text into words. We may need to replicate the logic in Golang. https://github.com/rivo/uniseg
CREATE TABLE _search_user
(
id text PRIMARY KEY,
app_id text NOT NULL,
app_ids text[]
GENERATED ALWAYS AS (ARRAY[app_id]) STORED,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
last_login_at timestamp without time zone,
is_disabled boolean NOT NULL,
emails jsonb NOT NULL DEFAULT '[]'::jsonb,
email_local_parts jsonb NOT NULL DEFAULT '[]'::jsonb,
email_domains jsonb NOT NULL DEFAULT '[]'::jsonb,
preferred_usernames jsonb NOT NULL DEFAULT '[]'::jsonb,
phone_numbers jsonb NOT NULL DEFAULT '[]'::jsonb,
phone_number_country_codes jsonb NOT NULL DEFAULT '[]'::jsonb,
phone_number_national_numbers jsonb NOT NULL DEFAULT '[]'::jsonb,
oauth_subject_ids jsonb NOT NULL DEFAULT '[]'::jsonb,
details jsonb NOT NULL DEFAULT '{}'::jsonb,
details_tsvector tsvector
GENERATED ALWAYS AS (jsonb_to_tsvector('english', details, '["string", "numeric"]')) STORED
);
CREATE INDEX _search_user_app_id ON _search_user (app_id);
CREATE INDEX _search_user_app_id_created_at ON _search_user (app_id, created_at);
CREATE INDEX _search_user_app_id_last_login_at ON _search_user (app_id, last_login_at);
CREATE INDEX _search_user_gin ON _search_user USING GIN (
app_ids,
emails,
email_domains,
email_local_parts,
preferred_usernames,
phone_numbers,
phone_number_country_codes,
phone_number_national_numbers,
oauth_subject_ids,
details_tsvector);
Data generation script:
WITH id AS (
SELECT generate_series(1,{{NUMBER_OF_ROWS}}) AS id
),
drows AS (
SELECT
id.id AS id,
'app'::text || (id % 10)::text AS app_id,
NOW() + make_interval(mins => id.id) AS created_at,
NOW() + make_interval(mins => id.id) AS updated_at,
CASE WHEN (id.id % 10 = 0) THEN NULL
ELSE NOW() + make_interval(mins => id.id)
END AS last_login_at,
CASE WHEN (id.id % 10 = 1) THEN TRUE
ELSE FALSE
END AS is_disabled,
CASE WHEN (id.id % 10 = 2) THEN '[]'
ELSE '["' || ('user' || id.id::text || '@example.local') || '"]'
END::jsonb AS emails,
CASE WHEN (id.id % 10 = 2) THEN '[]'
ELSE '["' || ('user' || id.id::text) || '"]'
END::jsonb AS email_local_parts,
CASE WHEN (id.id % 10 = 2) THEN '[]'
ELSE '["example.local"]'
END::jsonb AS email_domains,
CASE WHEN (id.id % 10 = 3) THEN '[]'
ELSE '["' || ('user' || id.id::text) || '"]'
END::jsonb AS preferred_usernames,
CASE WHEN (id.id % 10 = 4) THEN '[]'
ELSE '["+852' || (40000000 + id.id) || '"]'
END::jsonb AS phone_numbers,
CASE WHEN (id.id % 10 = 4) THEN '[]'
ELSE '["+852"]'
END::jsonb AS phone_number_country_codes,
CASE WHEN (id.id % 10 = 4) THEN '[]'
ELSE '["' || (40000000 + id.id) || '"]'
END::jsonb AS phone_number_national_numbers,
CASE WHEN (id.id % 10 = 5) THEN '[]'
ELSE '["oauth-' || (1000000000000 + id.id) || '"]'
END::jsonb AS oauth_subject_ids,
CASE WHEN id.id % 3 = 0 THEN format('{"name": "Chan Tai Man", "address": "Sha Tin, NT"}')
WHEN id.id % 3 = 1 THEN format('{"name": "Wong Tai Sin", "address": "Tsim Sha Tsui, Kowloon"}')
ELSE format('{"name": "Chi Sin", "address": "Causeway Bay, Hong Kong Island"}')
END::jsonb AS details
FROM id
)
INSERT INTO _search_user (
id, app_id, created_at, updated_at, last_login_at,
is_disabled, emails, email_local_parts, email_domains,
preferred_usernames, phone_numbers,
phone_number_country_codes, oauth_subject_ids, details,
) SELECT * FROM drows;
Two queries were used to test:
-- Q1: This query match only a very small number of rows
EXPLAIN ANALYZE SELECT * FROM _search_user AS su
WHERE
su.app_id = 'app0' AND (
su.app_ids @> ARRAY['app0'] AND (
su.details_tsvector @@ websearch_to_tsquery('english', 'user100050') OR
su.emails ? 'user100050' OR
su.email_domains ? 'user100050' OR
su.email_local_parts ? 'user100050'
)
)
ORDER BY su.created_at DESC
LIMIT 100;
-- Q2: This query match a lot of rows
EXPLAIN ANALYZE SELECT * FROM _search_user AS su
WHERE
su.app_id = 'app0' AND (
su.app_ids @> ARRAY['app0'] AND (
su.details_tsvector @@ websearch_to_tsquery('english', 'example.local') OR
su.emails ? 'example.local' OR
su.email_domains ? 'example.local' OR
su.email_local_parts ? 'example.local'
)
)
ORDER BY su.created_at DESC
LIMIT 100;
Test with 10M rows. 1M rows in each app. (app0 - app9)
"example.local"
matches a lot of rows, the result set is still large therefore it becomes very slow.Test with 1M rows. 100K rows in each app (app0 - app9)
LIMIT
is small, and OFFSET
is not large:@louischan-oursky @fungc-io @chpapa
I've tried to do a preformance test on searching with postgresql, and found there will be performance issue when the query result is large enough. We might need to decide one of the approach I listed in "Possible actions". Or maybe see if you have other suggestions?
In elasticsearch, we used UAX URL Email Tokenizer for any searchable word. It is a tokenizer based on the standard tokenizer which uses Unicode Text Segmentation algorithm.
This is similar to N-Gram Tokenizer of elasticsearch.
I didn't use trigram to implement the search because our original search is not using n-gram tokenizer. It might be useful if we want to search and sort user by similarity score.
As it is very time consuming to generate test data in elasticsearch, finally I only made a test with 1M rows with same app_id.
This test is performed with a single node elasticsearch in local.
When testing with a query which matches about ~80k rows:
from
parameter), which is by default 10000."track_total_hits": true
can force it to return a accurate count. The impact on performance is not observable in this test.
Problem
To reduce the deployment requirement, consider removing elastic search as a dependency in Authgear.
Currently Elastic Search is used in the search function in the Admin API, See if PostgreSQL can implement the same features.
Appetite
2 weeks
Todo
See: https://www.postgresql.org/docs/current/datatype-textsearch.html
_search_user
id: text
app_id: text
document: tsvector
This column should be concatenated by all columns "query" in the current elasticsearch query.Service.ReindexUser
to upsert / delete rows in_search_user
.Service.QueryUser
to convert the search keyword into tsquery.