eulerto / pg_similarity

set of functions and operators for executing similarity queries
BSD 3-Clause "New" or "Revised" License
365 stars 40 forks source link

Querying a table with jaccard similarity with 1.6 million records take 12 seconds #32

Open crbsram opened 3 years ago

crbsram commented 3 years ago

Hi Team

We have encountered a problem in our testing environment. I have a scenario where I am running a similarity match for an address I have created a table with following number of records 1603423 And I have created GIN index (since i am using pg_similarity) library jaccard similarity method when I run the Explain analyze EXPLAIN ANALYZE select complete_address, jaccard(complete_address, 'raj nagar ext near ajnara integrity up ghaziabad 201017') as qsim from address where complete_address~??'raj nagar ext near ajnara integrity up ghaziabad 201017' order by qsim DESC; QUERY PLAN

Sort (cost=5856.13..5860.14 rows=1603 width=93) (actual time=12101.194..12101.197 rows=6 loops=1) Sort Key: (jaccard(complete_address, 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text)) DESC Sort Method: quicksort Memory: 25kB -> Bitmap Heap Scan on address (cost=172.43..5770.80 rows=1603 width=93) (actual time=3516.233..12101.172 rows=6 loops=1) Recheck Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text) Rows Removed by Index Recheck: 1039186 Heap Blocks: exact=58018 -> Bitmap Index Scan on address_complete_address_idx1 (cost=0.00..172.02 rows=1603 width=0) (actual time=256.037..256.037 rows=1039192 loops=1) Index Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text) Planning Time: 0.141 ms Execution Time: 12101.245 ms (11 rows) it took 12 seconds following are my pgconf file settings shared buffer as 4GB work_mem 256 MB maintenence_work_mem 512MB autovacuum_work_mem 20MB My index definition is this "address_complete_address_idx1" gin (complete_address gin_similarity_ops)

It is taking the index correctly. But why it took 12 seconds to process I really don't understand.

Please help.

Thanks C.R.Bala

eulerto commented 3 years ago

Your WHERE clause has low selectivity (1039192 of 1603423 rows), hence, Bitmap Heap Scan has to read and discard more than 1 million rows (actual time=3516.233..12101.172).

If you add BUFFERS as an EXPLAIN option, it shows how many rows pages it is reading.

crbsram commented 3 years ago

Hi Eulerto

Thanks for your reply.

Please find the stats as given below: dedupechecksample=# EXPLAIN (ANALYZE, BUFFERS) select complete_address, jaccard(complete_address, 'raj nagar ext near ajnara integrity up ghaziabad 201017') as qsim from address where complete_address~??'raj nagar ext near ajnara integrity up ghaziabad 201017' order by qsim DESC; QUERY PLAN

Sort (cost=5856.13..5860.14 rows=1603 width=93) (actual time=12023.801..12023.803 rows=0 loops=1) Sort Key: (jaccard(complete_address, 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text)) DESC Sort Method: quicksort Memory: 25kB Buffers: shared hit=58306 -> Bitmap Heap Scan on address20 (cost=172.43..5770.80 rows=1603 width=93) (actual time=12023.793..12023.794 rows=0 loops=1) Recheck Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text) Rows Removed by Index Recheck: 1039192 Heap Blocks: exact=58018 Buffers: shared hit=58306 -> Bitmap Index Scan on address_complete_address_idx1 (cost=0.00..172.02 rows=1603 width=0) (actual time=260.443..260.443 rows=1039192 loops=1) Index Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text) Buffers: shared hit=288 Planning: Buffers: shared hit=1 Planning Time: 0.129 ms Execution Time: 12025.412 ms (16 rows)

Am I missing anything here? If I need to improve the performance any other settings has to be done. Please help here. If this can't be improved they are thinking of alternatives to use faiss library.

Thanks C.R.Bala