GoogleCloudPlatform / bigquery-utils

Useful scripts, udfs, views, and other utilities for migration and data warehouse operations in BigQuery.
https://cloud.google.com/bigquery/
Apache License 2.0
1.11k stars 277 forks source link

Bug: Jaccard UDF giving incorrect results in some cases #383

Closed j-svensmark closed 3 months ago

j-svensmark commented 1 year ago

I found a bug in the the Jaccard distance function https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/jaccard.sqlx (the function below is a copy of the Jaccard UDF, cast as a TEMP FUNCTION)

CREATE TEMP FUNCTION Jaccard(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS r"""
  let intersectSize = 0;

  if(a && b) {
    // Split word into characters, and make characters unique
    const sa = [...new Set(String(a).split(''))]
    const sb = [...new Set(String(b).split(''))]

    const la = (sa.length > sb.length) ? sa.length : sb.length
    const lb = (sa.length > sb.length) ? sb.length : sa.length

    for(let i = 0; i < la; i++) {
      for(let j = 0; j < lb; j++) {
        intersectSize += (sa[i] === sb[j]) ? 1 : 0
      }
    }

    // Compute Jaccard distance
    const union = (sa.length + sb.length) - intersectSize
    return parseFloat(((intersectSize / union) * 100 + Number.EPSILON) / 100).toFixed(2)
  } else {
    return 0.0
  }
""";

with datas as (
select 
'12' as a,
'132' as b
)

select
 Jaccard(a, b)
from datas

This should give 2/3 = 0.667 (since here there the intersection is size 2, and the union is size 3), but instead gives 0.25.

I believe the bug can be fixed by replacing

    const la = (sa.length > sb.length) ? sa.length : sb.length
    const lb = (sa.length > sb.length) ? sb.length : sa.length

with

    const la = sa.length
    const lb = sb.length

With the current version of these lengths, the array indices become out of bounds.

j-svensmark commented 1 year ago

It might be possible to further optimize and generalize this function to arrays of any types, rather than just strings by casting it in SQL

create temp function `Jaccard`(a ANY TYPE, b ANY TYPE)
returns FLOAT64 as (
  (select count(distinct agrp) from unnest(a) as agrp inner join unnest(b) as bgrp on agrp = bgrp)/
  (select count(1) from (select * from unnest(a) union distinct select * from unnest(b)))
);

with datas as (
select 
['1', '2'] as a,
['1', '3', '2'] as b,
[1, 2] as a_int,
[1, 3, 2] as b_int
)

select
 Jaccard(a, b),
 Jaccard(a_int, b_int)
from datas

(See https://stackoverflow.com/a/36705483)