Closed kkartaltepe closed 7 years ago
3 Characters or more sounds good with me
What if you want to search for episode numbers? Also how is the limit determined? utf-8 bytes? unicode codepoints?
unicode codepoints is my understanding.
As for the episode numbers there is no need to limit every search term to 3 or more characters, just to prevent searches of only less than 3 characters.
You can play around with the explain plans yourself to see how things break down just to double check.
-- edit although if you want to support 'or' queries you may want to restrict each 'or' clause to containing at least one term of 3 characters.
Due to how search works searching for "some-anime 04" will also trigger the performance problem with trigrams
(SELECT * FROM torrents WHERE torrent_name LIKE '%some-anime%' AND torrent_name LIKE '%04%'
)
You know I learn something every day. Unfortunately it appears postgres makes some wrong assumptions with respect to GIN indices and will try to merge all predicates into one type of scan (index or table) rather than considering separating the predicates into different nodes. (Should be noted that the sequential scan is in fact much faster than the index only scan on the combined predicate so at least is partially right!)
You can force an index scan followed by a filter which is what I expected it to chose with something like
with q as (select * from torrents where torrent_name like '%some-anime%')
select * from q where torrent_name like '%04%' and torrent_name ...
or apparently
select * from torrents
where torrent_name like '%some-anime%' and (torrent_name||'') like '%04%' and (torrent_name||') ...
Pretty ugly, but it appears to be the only way to make up for the query planner. Maybe someone can be woo'ed to improve the query planner for pg11.
Is it still a thing? Since we use ES now
If you are no longer using postgres indices for accelerating these searches then this is probably better off closed.
You should probably disable searches for 1-2 characters as it cannot utilize the trigram index meaning its a vector for abuse and arguably useless.
Only consideration I can see is CJK users doing searchs for 1-2 characters but hopefully its not too restrictive for them.