ADBond / splinkclickhouse

Allows Clickhouse to be used as the execution engine for Splink
MIT License
5 stars 0 forks source link

Bug - count star workaround + Feature pandas frames for ClickhouseAPI #18

Closed ADBond closed 2 months ago

ADBond commented 2 months ago

This does two separate things, that both are needed for passing new test (graph metrics). Apologies for not disentangling.

Count(*)

Due to this issue count(*) with a filter is not parsed correctly. In Splink such an expression is used in graph metric calculations, which led to an error. As a workaround we intercept + rewrite the SQL.

ClickhouseAPI pandas ingestion

The other issue with graph metrics (for ClickhouseAPI but not ChDBAPI) is that we directly register a pandas dataframe with the API, which was previously disallowed.

Now we support this by use of a helper function which creates a table based on pandas typing information. It currently only accepts strings or integer types. It will probably not expand much, leaving more complex typing situations to be sorted by the user by either pre-registering, or using other table input formats.

CI

In for a penny, this actually also does a third thing of correcting the type-hinting CI (see #8) trigger path.