django-haystack / django-haystack

Modular search for Django
http://haystacksearch.org/
Other
3.61k stars 1.3k forks source link

NgramField acting like a CharField when BooleanField is present #1028

Open sebclaeys opened 10 years ago

sebclaeys commented 10 years ago

Description / How to reproduce: http://stackoverflow.com/questions/24659326/ngramfield-not-working-if-booleanfield-is-present-haystack-elasticsearch-wi

karolmajta commented 10 years ago

This has bitten me too, after putting some print statements in pyelasticsearch (client.py) I've managed to collect some more data on this:

The index is created with settings from ES backend:

{
    "settings": {
        "analysis": {
            "filter": {
                "haystack_edgengram": {
                    "max_gram": 15,
                    "type": "edgeNGram",
                    "min_gram": 2
                },
                "haystack_ngram": {
                    "max_gram": 15,
                    "type": "nGram",
                    "min_gram": 3
                }
            },
            "tokenizer": {
                "haystack_ngram_tokenizer": {
                    "max_gram": 15,
                    "type": "nGram",
                    "min_gram": 3
                },
                "haystack_edgengram_tokenizer": {
                    "max_gram": 15,
                    "type": "edgeNGram",
                    "side": "front",
                    "min_gram": 2
                }
            },
            "analyzer": {
                "edgengram_analyzer": {
                    "filter": [
                        "haystack_edgengram"
                    ],
                    "type": "custom",
                    "tokenizer": "lowercase"
                },
                "ngram_analyzer": {
                    "filter": [
                        "haystack_ngram"
                    ],
                    "type": "custom",
                    "tokenizer": "lowercase"
                }
            }
        }
    }
}

I think this globally sets the tokenizer on the index (am i right?) Then django-haystack tries setting up the modelresult mapping:

{
    "modelresult": {
        "_boost": {
            "name": "boost",
            "null_value": 1
        },
        "properties": {
            "text": {
                "index": "analyzed",
                "term_vector": "with_positions_offsets",
                "type": "string",
                "analyzer": "snowball",
                "boost": 1,
                "store": "yes"
            },
            "i_cause_errors": {
                "index": "analyzed",
                "boost": 1,
                "store": "yes",
                "type": "boolean"
            },
            "text_auto": {
                "index": "analyzed",
                "term_vector": "with_positions_offsets",
                "type": "string",
                "analyzer": "ngram_analyzer",
                "boost": 1,
                "store": "yes"
            }
        }
    }
}

Unfortunately this causes elasticsearch to complain:

{
    "error": "ElasticsearchIllegalArgumentException[bool field can't be tokenized]",
    "status": 400
}

This exception gets swollen somewhere along the way, and the resulting mapping is invalid.

This is what you get:

{
    "haystack": {
        "mappings": {
            "modelresult": {
                "properties": {
                    "django_ct": {
                        "type": "string"
                    },
                    "django_id": {
                        "type": "string"
                    },
                    "i_cause_errors": {
                        "type": "boolean"
                    },
                    "text": {
                        "type": "string"
                    },
                    "text_auto": {
                        "type": "string"
                    }
                }
            }
        }
    }
}

While this is what you'd expect:

{
    "haystack": {
        "mappings": {
            "modelresult": {
                "_boost": {
                    "name": "boost",
                    "null_value": 1
                },
                "properties": {
                    "django_ct": {
                        "type": "string"
                    },
                    "django_id": {
                        "type": "string"
                    },
                    "i_cause_errors": {
                        "type": "boolean"
                    },
                    "text": {
                        "type": "string",
                        "store": true,
                        "term_vector": "with_positions_offsets",
                        "analyzer": "snowball"
                    },
                    "text_auto": {
                        "type": "string",
                        "store": true,
                        "term_vector": "with_positions_offsets",
                        "analyzer": "ngram_analyzer"
                    }
                }
            }
        }
    }
}

Hopefully this may help someone to craft a fix :)