cessda / cessda.cdc.versions

Issue track and wiki for the CESSDA Data Catalogue
https://datacatalogue.cessda.eu/
Apache License 2.0
0 stars 0 forks source link

Use field that has case_insensitive_normalizer for Topic and Keywords filters #609

Closed markusjt closed 1 year ago

markusjt commented 1 year ago

Currently searching for e.g. "working conditions" in Keywords results in: WORKING CONDITIONS (968) Working conditions (659) working conditions (232)

All the languages have at least normalizer with lowercase filter called case_insensitive_normalizer.

The normalizer in settings_cmmstudy_en.json:

    "normalizer": {
      "case_insensitive_normalizer": {
        "type": "custom",
        "char_filter": [],
        "filter": [
          "lowercase",
          "asciifolding"
        ]
      }
    }

This should be added for classifications and keywords in mappings_cmmstudy.json (... used as a placeholder to shorten this but had to add "" to make it valid json for formatting):

      "classifications": {
        "type": "nested",
        "properties": {
          "...",
          "term": {
            "type": "keyword",
            "ignore_above": 256,
            "copy_to": "classificationsSearchField",
            "fields": {
              "normalized": {
                "type": "keyword",
                "normalizer": "case_insensitive_normalizer"
              }
            }
          },
          "..."
        }
      },
      "keywords": {
        "type": "nested",
        "properties": {
          "...",
          "term": {
            "type": "text",
            "analyzer": "pasc_index_autocomplete_analyzer",
            "search_analyzer": "pasc_standard_analyzer",
            "term_vector": "with_positions_offsets",
            "copy_to": [ "keywordsSearchField", "keywordsKeywordField" ],
            "fields": {
              "normalized": {
                "type": "keyword",
                "normalizer": "case_insensitive_normalizer"
              }
            }
          },
          "..."
        }
      },

And then used in RefinementListFilter (:

<RefinementListFilter id="classifications.term"
                       ...
                      field={'classifications.term.normalized'}
                      fieldOptions={{
                      type: 'nested',
                        options: {path: 'classifications', min_doc_count: 1}
                      }}
                      ...
>
<RefinementListFilter id="keywords.term"
                      ...
                      field={'keywords.term.normalized'}
                      fieldOptions={{
                      type: 'nested',
                        options: {path: 'keywords', min_doc_count: 1}
                      }}
                      ...
>

Does adding these new field in mappings require reindexing? I think it might work without reindexing by using update API after adding the fields.

Relevant Elasticsearch docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html

markusjt commented 1 year ago

Opened PRs for this: https://github.com/cessda/cessda.cdc.osmh-indexer.cmm/pull/41 https://github.com/cessda/cessda.cdc.searchkit/pull/168

Obviously requires index changes before searchkit changes can be merged so I don't know the proper procedure for this but I assume @matthew-morris-cessda does so I'll leave it for you, thank you! You can even do the merging on them if you want or I can still do it after approval like we've done so far.