CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.
Describe the bug
The postgres datastore creates several indices when a resource is uploaded.
A unique index for the _id internal field, an FTS index using the _full_text internal field (which stores all the column values in one field for FTS searches, effectively doubling the width of the table), and one index per text column.
Though this is great for ad-hoc queries using datastore_search and datastore_search_sql, and for interactive filtering on the UI, it greatly increases the datastore's storage requirements.
For example, a table with 1.37 million rows and 9 columns is:
60 mb as a csv file
302 mb for the database table
569 mb for the 8 indices (unique index for _id, 1 fts index, and 6 indices for the 6 text columns. It did not create indices for the one timestamp and two numeric columns)
Expected behavior
The team should consider giving finer granular control to the CKAN administrator beyond specifying the index method.
Some indexing "knobs" to consider:
turn on/off FTS index
leverage the data dictionary to explicitly declare columns that should be indexed. As an added benefit, this should also allow indices for non-text columns
or alternatively, instead of creating a mini DBA interface on CKAN (a non-trivial task), just read the postgres system catalog and selectively show some indexing info on the data dictionary (leveraging the expanded datastore_info), and let the DB/CKAN administrator use robust tools like PgAdmin to manage datastore indices manually. This is consistent with the keeping the Datastore API "as thin as possible to allow you to use the features you would expect from a powerful database management system."
CKAN version 2.9.1
Describe the bug The postgres datastore creates several indices when a resource is uploaded.
A unique index for the
_id
internal field, an FTS index using the_full_text
internal field (which stores all the column values in one field for FTS searches, effectively doubling the width of the table), and one index per text column.Though this is great for ad-hoc queries using
datastore_search
anddatastore_search_sql
, and for interactive filtering on the UI, it greatly increases the datastore's storage requirements.For example, a table with 1.37 million rows and 9 columns is:
_id
, 1 fts index, and 6 indices for the 6 text columns. It did not create indices for the one timestamp and two numeric columns)I got this info while testing the expanded
datastore_info
PR.Expected behavior The team should consider giving finer granular control to the CKAN administrator beyond specifying the index method.
Some indexing "knobs" to consider:
datastore_info
), and let the DB/CKAN administrator use robust tools like PgAdmin to manage datastore indices manually. This is consistent with the keeping the Datastore API "as thin as possible to allow you to use the features you would expect from a powerful database management system."