GoogleCloudPlatform / bigquery-utils

Useful scripts, udfs, views, and other utilities for migration and data warehouse operations in BigQuery.
https://cloud.google.com/bigquery/
Apache License 2.0
1.07k stars 269 forks source link

add support for datasketches js udafs using wasm libraries #423

Closed nikunjbhartia closed 2 weeks ago

nikunjbhartia commented 2 weeks ago

Adding support for 3 datasketches sketch types : 1) Theta sketch 2) KLL sketch 3) Tuple sketch

Total functions:

Solution approach: Datasketch cpp libraries are compiled to WASM libraries using emscripten toolchain and loaded in BQ JS UDAFs. using docker.io emscripten image in build file

Note: currently unit tests are executed using dataform test which do no support UDAFs or constant arguments. Hence, added test cases for 7 scalar functions only

There is also a datasketches-cpp directory which could be added as a submodule, committed all the files for now, lets discuss during review.

danieldeleo commented 2 weeks ago

I've submitted https://github.com/GoogleCloudPlatform/bigquery-utils/pull/425 to enable unit testing for UDAFs. Once it's merged, please pull the changes and add the remaining 9 test cases for the UDAFs

nikunjbhartia commented 2 weeks ago

Thanks for adding support for UDAF unit testing. Current framework uses positional args which would fail for functions having const args. we need to add additional support for passing constant args which are defined as "NOT AGGREGATE" in UDAF function signature.

I have added unit test for 1 UDAF which did not have any const arg in its signature. Rest of the 8 UDAFs have const args and hence relevant tests can't be added yet.

danieldeleo commented 2 weeks ago

@nikunjbhartia you should be good to go for the rest of the 8 UDAFs now that https://github.com/GoogleCloudPlatform/bigquery-utils/pull/426 is merged

nikunjbhartia commented 2 weeks ago

Thanks @danieldeleo for the quick fix. Made following updates:

danieldeleo commented 2 weeks ago

/gcbrun