eulerto / pg_similarity

set of functions and operators for executing similarity queries
BSD 3-Clause "New" or "Revised" License
363 stars 39 forks source link

cosine and jaro on perfect matches #22

Open jleblay opened 5 years ago

jleblay commented 5 years ago

Hi, I am starting to use this library and came across those potential glitches. I would expect cosine, jaro and jarowinkler to return exactly 1.0 on equal strings, but I get the following (jarowinkler omitted, but shows the same behaviour as jaro on this):

server_encoding                         | UTF8 
client_encoding                         | UTF8
pg_similarity.cosine_is_normalized      | on
pg_similarity.jaro_is_normalized        | on

cl_generated_1000=# select version();
                                                   version                                                    
--------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.6.9 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.0 (clang-700.1.76), 64-bit
(1 row)
# set pg_similarity.cosine_tokenizer = 'alnum';
SET
# set pg_similarity.jaro_tokenizer = 'alnum';
SET
# select cosine('Michael C', 'Michael C') < 1.,
         cosine('Brésil', 'Brésil') < 1.,
         jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
         jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
 t        | t        | t        | t
(1 row)

# set pg_similarity.cosine_tokenizer = 'word';
SET
# set pg_similarity.jaro_tokenizer = 'word';
SET
# select cosine('Michael C', 'Michael C') < 1.,
         cosine('Brésil', 'Brésil') < 1.,
         jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
         jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
 t        | f        | t        | t
(1 row)

# set pg_similarity.cosine_tokenizer = 'gram';
SET
# set pg_similarity.jaro_tokenizer = 'gram';
SET
# select cosine('Michael C', 'Michael C') < 1.,
         cosine('Brésil', 'Brésil') < 1.,
         jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
         jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
 f        | f        | t        | t
(1 row)

# set pg_similarity.cosine_tokenizer = 'camelcase';
SET
# set pg_similarity.jaro_tokenizer = 'camelcase';
SET
# select cosine('Michael C', 'Michael C') < 1.,
         cosine('Brésil', 'Brésil') < 1.,
         jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
         jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
 ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------
 t        | f        | t        | t
(1 row)
eulerto commented 5 years ago

All pg_similarity functions return double precision. double precision is inexact [1]. Use a cast like:

select cast(cosine('Michael C', 'Michael C') as numeric(3,2)) < 1.0;