Overly aggressive indexing strategy greatly increases datastore storage requirements

jqnatividad commented 3 years ago

CKAN version 2.9.1

Describe the bug The postgres datastore creates several indices when a resource is uploaded.

A unique index for the _id internal field, an FTS index using the _full_text internal field (which stores all the column values in one field for FTS searches, effectively doubling the width of the table), and one index per text column.

Though this is great for ad-hoc queries using datastore_search and datastore_search_sql, and for interactive filtering on the UI, it greatly increases the datastore's storage requirements.

For example, a table with 1.37 million rows and 9 columns is:

60 mb as a csv file
302 mb for the database table
569 mb for the 8 indices (unique index for _id, 1 fts index, and 6 indices for the 6 text columns. It did not create indices for the one timestamp and two numeric columns)

I got this info while testing the expanded datastore_info PR.

Expected behavior The team should consider giving finer granular control to the CKAN administrator beyond specifying the index method.

Some indexing "knobs" to consider:

turn on/off FTS index
leverage the data dictionary to explicitly declare columns that should be indexed. As an added benefit, this should also allow indices for non-text columns
or alternatively, instead of creating a mini DBA interface on CKAN (a non-trivial task), just read the postgres system catalog and selectively show some indexing info on the data dictionary (leveraging the expanded datastore_info), and let the DB/CKAN administrator use robust tools like PgAdmin to manage datastore indices manually. This is consistent with the keeping the Datastore API "as thin as possible to allow you to use the features you would expect from a powerful database management system."

wardi commented 3 years ago

:+1: to using the data dictionary interface to enable indexes and having the per-column indexes disabled by default

wardi commented 1 week ago

Let's:

Update datastore_create to default to not add any indexes when the index list of fields is empty or not provided.
Show which fields are indexed in the data dictionary and make it possible to add or remove indexes for field types that support indexing. Future: we might want to be able to select which columns are included in the full text search index here too.
For large datasets creating indexes will take a long time, we should use a background job to run the datastore_create call so that the web UI doesn't time out returning a response to the user.

ckan / ckan

Overly aggressive indexing strategy greatly increases datastore storage requirements #5847