Closed j-svensmark closed 3 months ago
It might be possible to further optimize and generalize this function to arrays of any types, rather than just strings by casting it in SQL
create temp function `Jaccard`(a ANY TYPE, b ANY TYPE)
returns FLOAT64 as (
(select count(distinct agrp) from unnest(a) as agrp inner join unnest(b) as bgrp on agrp = bgrp)/
(select count(1) from (select * from unnest(a) union distinct select * from unnest(b)))
);
with datas as (
select
['1', '2'] as a,
['1', '3', '2'] as b,
[1, 2] as a_int,
[1, 3, 2] as b_int
)
select
Jaccard(a, b),
Jaccard(a_int, b_int)
from datas
I found a bug in the the Jaccard distance function https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/jaccard.sqlx (the function below is a copy of the Jaccard UDF, cast as a TEMP FUNCTION)
This should give 2/3 = 0.667 (since here there the intersection is size 2, and the union is size 3), but instead gives 0.25.
I believe the bug can be fixed by replacing
with
With the current version of these lengths, the array indices become out of bounds.