ckan / ckan

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.
https://ckan.org/
Other
4.48k stars 2k forks source link

Overly aggressive indexing strategy greatly increases datastore storage requirements #5847

Open jqnatividad opened 3 years ago

jqnatividad commented 3 years ago

CKAN version 2.9.1

Describe the bug The postgres datastore creates several indices when a resource is uploaded.

A unique index for the _id internal field, an FTS index using the _full_text internal field (which stores all the column values in one field for FTS searches, effectively doubling the width of the table), and one index per text column.

Though this is great for ad-hoc queries using datastore_search and datastore_search_sql, and for interactive filtering on the UI, it greatly increases the datastore's storage requirements.

For example, a table with 1.37 million rows and 9 columns is:

I got this info while testing the expanded datastore_info PR.

Expected behavior The team should consider giving finer granular control to the CKAN administrator beyond specifying the index method.

Some indexing "knobs" to consider:

wardi commented 3 years ago

:+1: to using the data dictionary interface to enable indexes and having the per-column indexes disabled by default

wardi commented 1 week ago

Let's:

  1. Update datastore_create to default to not add any indexes when the index list of fields is empty or not provided.
  2. Show which fields are indexed in the data dictionary and make it possible to add or remove indexes for field types that support indexing. Future: we might want to be able to select which columns are included in the full text search index here too.
  3. For large datasets creating indexes will take a long time, we should use a background job to run the datastore_create call so that the web UI doesn't time out returning a response to the user.