gentics / mesh-incubator

Project which is home for planned enhancements for Gentics Mesh
3 stars 0 forks source link

ElasticSearch: specify analyzer per language #218

Closed mephinet closed 4 years ago

mephinet commented 4 years ago

Currently, when creating/updating a schema in Gentics Mesh, a schema-wide ElasticSearch configuration can be provided, containing (among other things) filters and analyzers. This configuration can then be used, and extended, for each field. While this concept works fine for single-language projects, in multi-language projects the ElasticSearch analyzer configuration is language-dependent, cf https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html . Therefore, the field configuration needs to allow specifying one analyzer per language, plus one fallback...

mephinet commented 4 years ago
philippguertler commented 4 years ago

Specification

Goals

Proposal

Inside the schema create or update request, the $.elasticsearch and $.fields.{{fieldName}}.elasticsearch properties will allow the addition of the _meshLanguageOverride field. This field must be an object. The keys of this object must be a language used by nodes in Mesh or a comma separated list of those languages. The values must be the setting (for index settings) or the mapping of the field (for field mappings).

When creating or updating a valid schema with the _meshLanguageOverride set, Mesh will create additional indices for each language found in these objects. During the schema migration, nodes of that language will then be put to the corresponding new index or the default index if the language of the node was not configured in the _meshLangaugeOverride field. The default index uses the settings and mappings found directly in the $.elasticsearch and $.fields.{{fieldName}}.elasticsearch properties of the schema.

When searching, Mesh will query all node indices, just like before. The query will be analysed according to the index mappings, which means that the correct settings/mappings will automatically be chosen. If the user wishes to only query nodes of a specific language, the query itself must contain that constraint by querying the $.language field.

Example

{
  "name": "page",
  "elasticsearch": {
    "_meshLanguageOverride": {
      "de": {
        "analyzer": {
          "my_stop_analyzer": {
            "type": "stop",
            "stopwords": "_german_"
          }
        }
      },
      "jp,zh,ko": {
        "analyzer": {
          "my_stop_analyzer": {
            "type": "stop",
            "stopwords": "_cjk_"
          }
        }
      }
    },
    "analyzer": {
      "my_stop_analyzer": {
        "type": "stop",
        "stopwords": "_english_"
      }
    }
  },
  "fields": [
    {
      "name": "title",
      "type": "string",
      "elasticsearch": {
        "basicsearch": {
          "type": "text",
          "analyzer": "my_stop_analyzer"
        }
      }
    },
    {
      "name": "content",
      "type": "string",
      "elasticsearch": {
        "_meshLanguageOverride": {
          "fr": {
            "basicsearch": {
              "type": "text",
              "analyzer": "standard"
            }
          }
        },
        "basicsearch": {
          "type": "text",
          "analyzer": "my_stop_analyzer"
        }
      }
    }
  ]
}

This schema defines the my_stop_analyzer. Per default, the english stop word list will be used to filter out certain words. Nodes with language de will use a different list and nodes with the language of either zh, jp or ko will use another list.

The title field uses this analyzer, which will be different for some langauges as described above.

The content field uses the same analyzer. However, an exception has been made for nodes with the language fr. Here, the standard analyzer (which has no stop words) will be used instead.

TODOs in Mesh