ckan / ckan

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.
https://ckan.org/
Other
4.43k stars 1.98k forks source link

Overly aggressive indexing strategy greatly increases datastore storage requirements #5847

Open jqnatividad opened 3 years ago

jqnatividad commented 3 years ago

CKAN version 2.9.1

Describe the bug The postgres datastore creates several indices when a resource is uploaded.

A unique index for the _id internal field, an FTS index using the _full_text internal field (which stores all the column values in one field for FTS searches, effectively doubling the width of the table), and one index per text column.

Though this is great for ad-hoc queries using datastore_search and datastore_search_sql, and for interactive filtering on the UI, it greatly increases the datastore's storage requirements.

For example, a table with 1.37 million rows and 9 columns is:

I got this info while testing the expanded datastore_info PR.

Expected behavior The team should consider giving finer granular control to the CKAN administrator beyond specifying the index method.

Some indexing "knobs" to consider:

wardi commented 3 years ago

:+1: to using the data dictionary interface to enable indexes and having the per-column indexes disabled by default