How can I use the analyzer installed as an additional plugin?

jaehyeongAN commented 2 years ago

Question Hi! I found out that the standard analyzer can be changed through that PR. But the language I use is not included in the built-in language analyzers. (I use Korean.)

However, from that link, I found out that the analyzer that supports Korean in Elasticsearch can be used by installing an additional plugin.

However, after installing additional plugins, custom_mapping is applied, but the analyzer does not work. Below is my code.

# mapping dict
mapping = {
    "mappings": {
        "properties": {
            "name": {"type": "keyword"},
            "text": {"type": "text"},
            "embed_vector": {"type":"dense_vector", "dims":768}
        },
        "dynamic_templates": [{
            "strings": {
                "path_match": "*",
                "match_mapping_type": "string",
                "mapping": {"type": "keyword"}
            }
        }],
    },
    "settings": {
        "analysis": {
            "tokenizer": {
                "korean_tokenizer": {
                    "type": "nori_tokenizer",
                    "decompound_mode": "mixed"
                }
            },
            "analyzer": {
                "korean_analyzer": {
                    "type": "custom",
                    "tokenizer": "korean_tokenizer"
                }
            }
        }
    }
}

# DocumentStore
document_store = ElasticsearchDocumentStore(
    host=ES_HOST,
    port=ES_PORT,
    scheme=ES_SCHEME,
    index=INDEX_NAME, 
    embedding_dim=768,
    embedding_field="embed_vector",
    similarity="cosine",
    duplicate_documents='skip',
    custom_mapping=mapping
)

When I experimented with the Elasticsearch URI, it worked fine.

GET chatbot-agent-assist/_analyze
{
  "analyzer": "nori",
  "text": "세상에서 가장 신나는 일요일 아침!"
}

Is the custom_mapping part the problem? Thanks for your comment!

Additional context

FAQ Check

[ V ] Have you had a look at our new FAQ page?

jaehyeongAN commented 2 years ago

I resolve the issuse. (v.0.10.0)

# mapping dict
mapping = {
    "mappings": {
        "properties": {
            "name": {
                "type": "keyword"
            },
            "text": {
                "type": "text", 
                "analyzer":"korean_analyzer",  # add this!
                "search_analyzer":"standard"
            },
            "embed_vector": {
                "type":"dense_vector", "dims":768
            }
        },
        "dynamic_templates": [
            {
                "strings": {
                    "path_match": "*",
                    "match_mapping_type": "string",
                    "mapping": {"type": "keyword"}
                }
            }
        ],
    },
    "settings": {
        "analysis": {
            "tokenizer": {
                "korean_tokenizer": {
                    "type": "nori_tokenizer",
                    "decompound_mode": "mixed"
                }
            },
            "analyzer": {
                "korean_analyzer": {
                    "tokenizer": "korean_tokenizer"
                }
            }
        }
    }
}

# DocumentStore
document_store = ElasticsearchDocumentStore(
    host=ES_HOST,
    port=ES_PORT,
    scheme=ES_SCHEME,
    index=INDEX_NAME, 
    embedding_dim=768,
    embedding_field="embed_vector",
    similarity="cosine",
    duplicate_documents='skip',
    custom_mapping=mapping
)

apohllo commented 2 years ago

@jaehyeongAN I am only wondering why do you define a search_analyzer that is different from the analyzer? For me it seems that the query will be processed with the English (default) analyser. Not providing the search_analyzer would suffice to use the same analyzer for the docs and the queries.

deepset-ai / haystack

How can I use the analyzer installed as an additional plugin? #2223