dathere / datapusher-plus

A standalone web service that pushes data into the CKAN Datastore fast & reliably. It pushes real good!
GNU Affero General Public License v3.0
30 stars 21 forks source link

Low cardinality auto indexing #142

Open EricSoroos opened 5 months ago

EricSoroos commented 5 months ago

Describe the bug

Auto index generation on cardinality is backwards.

https://github.com/dathere/datapusher-plus/blob/master/datapusher/jobs.py#L1728

    # if a column's cardinality <= AUTO_INDEX_THRESHOLD, create an index for that column

Low cardinality indexes on postgresql aren't useful, because the planner will almost never choose them.

The planner will choose a table scan instead of an index scan if it is likely that many/most pages will be read. When you have a low cardinality index, eg a bool, the planner is going to assume that 1/2 of rows will be hit, and therefore essentially all pages. Even if you've got a distribution that's really skewed, the planner will generally not choose that index. (if you do have a known skewed distribution, you can do a partial/multicolumn index where the condition is the rare case, but that's not really applicable for a general purpose tool. )

This should be reversed, and set the default threshold to ~10 at least.

jqnatividad commented 5 months ago

I went for that heuristic as I was optimizing for accelerated aggregations - (e.g. Borough column only has 5 values - Manhattan,Bronx,Brooklyn,Queens,Staten Island...), and for filtering with the DataTables view Searchbuilder, where you have an interactive filter builder and response time is important.

Maybe, I can tweak the heuristic so that if the value is negative and less than -1, then the threshold is interpreted as greater than the absolute value?

Alternatively, I can add another set of AutoIndex settings that can be specified in the dotenv template to eval() an expression that DP+ maintainers can set to specify their own Autoindex heuristics?

EricSoroos commented 5 months ago

It's possible what you want is 5 and up -- that's getting to the point that the planner might choose to use the index, and that's something that an explain analyze would tell you. Right now I'm seeing it create indexes when there's one value text value, which is truly useless. (and in my case, it's a 13mb index + a 38mb text index)

EricSoroos commented 5 months ago

As a side note -- I'm not sure it's worth specifically creating unique indexes when doing the loads from datapusher, unless you're specifically trying to constrain additional entries that are added.

jqnatividad commented 5 months ago

It's possible what you want is 5 and up -- that's getting to the point that the planner might choose to use the index, and that's something that an explain analyze would tell you. Right now I'm seeing it create indexes when there's one value text value, which is truly useless. (and in my case, it's a 13mb index + a 38mb text index)

Yes. For that example, I would set the setting to 5 in the dotenv template. I'll change the default from 3 to 10 as you suggested as that setting is too low. I'll also add a minimum with a default of 3.

jqnatividad commented 5 months ago

As a side note -- I'm not sure it's worth specifically creating unique indexes when doing the loads from datapusher, unless you're specifically trying to constrain additional entries that are added.

Yes. That was what I was going for - to prevent erroneous upserts.