deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.11k stars 1.87k forks source link

How can I use the analyzer installed as an additional plugin? #2223

Closed jaehyeongAN closed 2 years ago

jaehyeongAN commented 2 years ago

Question Hi! I found out that the standard analyzer can be changed through that PR. But the language I use is not included in the built-in language analyzers. (I use Korean.)

However, from that link, I found out that the analyzer that supports Korean in Elasticsearch can be used by installing an additional plugin.

However, after installing additional plugins, custom_mapping is applied, but the analyzer does not work. Below is my code.

# mapping dict
mapping = {
    "mappings": {
        "properties": {
            "name": {"type": "keyword"},
            "text": {"type": "text"},
            "embed_vector": {"type":"dense_vector", "dims":768}
        },
        "dynamic_templates": [{
            "strings": {
                "path_match": "*",
                "match_mapping_type": "string",
                "mapping": {"type": "keyword"}
            }
        }],
    },
    "settings": {
        "analysis": {
            "tokenizer": {
                "korean_tokenizer": {
                    "type": "nori_tokenizer",
                    "decompound_mode": "mixed"
                }
            },
            "analyzer": {
                "korean_analyzer": {
                    "type": "custom",
                    "tokenizer": "korean_tokenizer"
                }
            }
        }
    }
}

# DocumentStore
document_store = ElasticsearchDocumentStore(
    host=ES_HOST,
    port=ES_PORT,
    scheme=ES_SCHEME,
    index=INDEX_NAME, 
    embedding_dim=768,
    embedding_field="embed_vector",
    similarity="cosine",
    duplicate_documents='skip',
    custom_mapping=mapping
)

When I experimented with the Elasticsearch URI, it worked fine.

GET chatbot-agent-assist/_analyze
{
  "analyzer": "nori",
  "text": "세상에서 가장 신나는 일요일 아침!"
}

Is the custom_mapping part the problem? Thanks for your comment!

Additional context

FAQ Check

jaehyeongAN commented 2 years ago

I resolve the issuse. (v.0.10.0)

# mapping dict
mapping = {
    "mappings": {
        "properties": {
            "name": {
                "type": "keyword"
            },
            "text": {
                "type": "text", 
                "analyzer":"korean_analyzer",  # add this!
                "search_analyzer":"standard"
            },
            "embed_vector": {
                "type":"dense_vector", "dims":768
            }
        },
        "dynamic_templates": [
            {
                "strings": {
                    "path_match": "*",
                    "match_mapping_type": "string",
                    "mapping": {"type": "keyword"}
                }
            }
        ],
    },
    "settings": {
        "analysis": {
            "tokenizer": {
                "korean_tokenizer": {
                    "type": "nori_tokenizer",
                    "decompound_mode": "mixed"
                }
            },
            "analyzer": {
                "korean_analyzer": {
                    "tokenizer": "korean_tokenizer"
                }
            }
        }
    }
}

# DocumentStore
document_store = ElasticsearchDocumentStore(
    host=ES_HOST,
    port=ES_PORT,
    scheme=ES_SCHEME,
    index=INDEX_NAME, 
    embedding_dim=768,
    embedding_field="embed_vector",
    similarity="cosine",
    duplicate_documents='skip',
    custom_mapping=mapping
)
apohllo commented 2 years ago

@jaehyeongAN I am only wondering why do you define a search_analyzer that is different from the analyzer? For me it seems that the query will be processed with the English (default) analyser. Not providing the search_analyzer would suffice to use the same analyzer for the docs and the queries.