Hi,
I am starting to use this library and came across those potential glitches.
I would expect cosine, jaro and jarowinkler to return exactly 1.0 on equal strings, but I get the following (jarowinkler omitted, but shows the same behaviour as jaro on this):
server_encoding | UTF8
client_encoding | UTF8
pg_similarity.cosine_is_normalized | on
pg_similarity.jaro_is_normalized | on
cl_generated_1000=# select version();
version
--------------------------------------------------------------------------------------------------------------
PostgreSQL 9.6.9 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.0 (clang-700.1.76), 64-bit
(1 row)
# set pg_similarity.cosine_tokenizer = 'alnum';
SET
# set pg_similarity.jaro_tokenizer = 'alnum';
SET
# select cosine('Michael C', 'Michael C') < 1.,
cosine('Brésil', 'Brésil') < 1.,
jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
t | t | t | t
(1 row)
# set pg_similarity.cosine_tokenizer = 'word';
SET
# set pg_similarity.jaro_tokenizer = 'word';
SET
# select cosine('Michael C', 'Michael C') < 1.,
cosine('Brésil', 'Brésil') < 1.,
jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
t | f | t | t
(1 row)
# set pg_similarity.cosine_tokenizer = 'gram';
SET
# set pg_similarity.jaro_tokenizer = 'gram';
SET
# select cosine('Michael C', 'Michael C') < 1.,
cosine('Brésil', 'Brésil') < 1.,
jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
f | f | t | t
(1 row)
# set pg_similarity.cosine_tokenizer = 'camelcase';
SET
# set pg_similarity.jaro_tokenizer = 'camelcase';
SET
# select cosine('Michael C', 'Michael C') < 1.,
cosine('Brésil', 'Brésil') < 1.,
jaro('Stratégie Internationale', 'Stratégie Internationale') < 1.,
jaro('http://example.org/Annecy', 'http://example.org/Annecy') < 1.;
?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
t | f | t | t
(1 row)
Hi, I am starting to use this library and came across those potential glitches. I would expect cosine, jaro and jarowinkler to return exactly 1.0 on equal strings, but I get the following (jarowinkler omitted, but shows the same behaviour as jaro on this):