Mapping conflicts cause the infinite shard allocation loop

sombut commented 8 years ago

I'm using the latest elasticsearch 2.1.1 as a part of ELK stack in the environment that difficult to control the consistency of the field type and mapping.

I noticed the mapping breaking change from 1.x version. I tried testing this to ensure the stability in 2.x cluster and found this unexpected behaviour.

I started the fresh elasticsearch 2.1.1 cluster with no data. I got the infinite loop of shard allocation failed and it kept retrying but never success when indexing data below using the bulk API.

Reproduce step: 1) Using the logstash template in attached. 2) run curl to elasticsearch using below data. Sometime it's success but sometime it failed with infinite loop of shard allocation failure. (I'm usually found this issue 1 in 3 times)

curl -s -XPOST localhost:9200/_bulk --data-binary "@/tmp/a"

the /tmp/a contains { "index" : { "_index" : "logstash-2016.02.01", "_type" : "tweet" } } { "message" : 0, "test" : 20, "status" : 0 } { "index" : { "_index" : "logstash-2016.02.01", "_type" : "tweet" } } { "message" : "value1", "test" : "test", "status" : "0" } { "index" : { "_index" : "logstash-2016.02.01", "_type" : "twoot" } } { "message" : "value1", "test" : "test", "status" : "0" } { "index" : { "_index" : "logstash-2016.02.01", "_type" : "twaat" } } { "message" : "value1", "test" : "test", "status" : "0" }

Once run you will get the error like this

I've also attached 1) error from elasticsearch log error.txt 2) logstash template logstash_template.txt 3) logstash-2016.02.01 index mapping after run the command logstash-2016.02.01_mapping.txt

I noticed something wrong in the mapping as follows: 1) The tweet type have the correct mapping ( all fields are having type: long) 2) Somehow the mapping of twaat type for all fields became string and this is the cause I got the infinite loop of shard recovering.

Once it happened the only way to resolve this is to delete that index which means if this happened in production environment, I need to delete the index and lose all data?

Your earliest help on this issue would be highly appreciated.

Thank you. Sombut

clintongormley commented 8 years ago

Hi @sombut

Thanks for the report - I've spent a few hours playing with the example and so far have been unable to get the shard failures that you're seeing. I've tried with a cluster of two and three nodes. What does your cluster look like?

Are you sure that it is ES 2.1.1? I've seen this happen on 2.1.0, but that was supposed to be fixed in 2.1.1. (There are a number of other mapping fixes coming in 2.2 and 2.3 as well.)

Something else I've seen on 2.1.1 while running your recreation (which I can't replicate on 2.2.0) is that the bulk request will sometimes hang indefinitely.

Btw, you can make your dynamic mappings better to avoid this issue by deleting the match_mapping_type parameter from the template for message_field, and you can delete all of the templates for double/float/integer/date/etc - those fields default to use doc_values already.

clintongormley commented 8 years ago

Recreation here:

DELETE *
DELETE _template/*
PUT _template/logstash
{
  "order": 0,
  "template": "logstash-*",
  "settings": {
    "index": {
      "mapping": {
        "ignore_malformed": "true"
      },
      "refresh_interval": "30s"
    }
  },
  "mappings": {
    "_default_": {
      "dynamic_templates": [
        {
          "message_field": {
            "mapping": {
              "fielddata": {
                "format": "disabled"
              },
              "index": "analyzed",
              "omit_norms": true,
              "type": "string"
            },
            "match_mapping_type": "string",
            "match": "message"
          }
        },
        {
          "string_fields": {
            "mapping": {
              "fielddata": {
                "format": "disabled"
              },
              "index": "analyzed",
              "omit_norms": true,
              "type": "string",
              "fields": {
                "raw": {
                  "ignore_above": 256,
                  "index": "not_analyzed",
                  "type": "string",
                  "doc_values": true
                }
              }
            },
            "match_mapping_type": "string",
            "match": "*"
          }
        },
        {
          "float_fields": {
            "mapping": {
              "type": "float",
              "doc_values": true
            },
            "match_mapping_type": "float",
            "match": "*"
          }
        },
        {
          "double_fields": {
            "mapping": {
              "type": "double",
              "doc_values": true
            },
            "match_mapping_type": "double",
            "match": "*"
          }
        },
        {
          "byte_fields": {
            "mapping": {
              "type": "byte",
              "doc_values": true
            },
            "match_mapping_type": "byte",
            "match": "*"
          }
        },
        {
          "short_fields": {
            "mapping": {
              "type": "short",
              "doc_values": true
            },
            "match_mapping_type": "short",
            "match": "*"
          }
        },
        {
          "integer_fields": {
            "mapping": {
              "type": "integer",
              "doc_values": true
            },
            "match_mapping_type": "integer",
            "match": "*"
          }
        },
        {
          "long_fields": {
            "mapping": {
              "type": "long",
              "doc_values": true
            },
            "match_mapping_type": "long",
            "match": "*"
          }
        },
        {
          "date_fields": {
            "mapping": {
              "type": "date",
              "doc_values": true
            },
            "match_mapping_type": "date",
            "match": "*"
          }
        },
        {
          "geo_point_fields": {
            "mapping": {
              "type": "geo_point",
              "doc_values": true
            },
            "match_mapping_type": "geo_point",
            "match": "*"
          }
        }
      ],
      "_all": {
        "omit_norms": true,
        "enabled": true
      },
      "properties": {
        "@timestamp": {
          "type": "date",
          "doc_values": true
        },
        "geoip": {
          "dynamic": true,
          "type": "object",
          "properties": {
            "ip": {
              "type": "ip",
              "doc_values": true
            },
            "latitude": {
              "type": "float",
              "doc_values": true
            },
            "location": {
              "type": "geo_point",
              "doc_values": true
            },
            "longitude": {
              "type": "float",
              "doc_values": true
            }
          }
        },
        "@version": {
          "index": "not_analyzed",
          "type": "string",
          "doc_values": true
        }
      }
    },
    "tornado_iis_advanced": {
      "properties": {
        "args": {
          "properties": {
            "version": {
              "norms": {
                "enabled": false
              },
              "type": "string",
              "fields": {
                "raw": {
                  "ignore_above": 256,
                  "index": "not_analyzed",
                  "type": "string"
                }
              }
            }
          }
        }
      }
    }
  },
  "aliases": {}
}

DELETE *
POST /_bulk
{"index":{"_index":"logstash-2016.02.01","_type":"tweet"}}
{"message":0,"test":20,"status":0}
{"index":{"_index":"logstash-2016.02.01","_type":"tweet"}}
{"message":"value1","test":"test","status":"0"}
{"index":{"_index":"logstash-2016.02.01","_type":"twoot"}}
{"message":"value1","test":"test","status":"0"}
{"index":{"_index":"logstash-2016.02.01","_type":"twaat"}}
{"message":"value1","test":"test","status":"0"}

GET _mapping/field/message,status

sombut commented 8 years ago

Hi @clintongormley

Thanks for your help.

My cluster has 1 master and 2 data nodes. I'm sure it is the 2.1.1 version. You can see the full log since start the elasticsearch until getting this error. compass-dev-esmn-01.zip You recreation steps is right and I can recreate the issue using yours too.

Below is my new template after match_mapping_type removed. It really helps avoid this issue but only when the conflicted field name is the message.

PUT _template/logstash
{
  "order": 0,
  "template": "logstash-*",
  "settings": {
    "index": {
      "mapping": {
        "ignore_malformed": "true"
      },
      "refresh_interval": "30s"
    }
  },
  "mappings": {
    "_default_": {
      "dynamic_templates": [
        {
          "message_field": {
            "mapping": {
              "fielddata": {
                "format": "disabled"
              },
              "index": "analyzed",
              "omit_norms": true,
              "type": "string"
            },
            "match": "message"
          }
        }
      ],
      "_all": {
        "omit_norms": true,
        "enabled": true
      },
      "properties": {
        "@timestamp": {
          "type": "date",
          "doc_values": true
        },
        "geoip": {
          "dynamic": true,
          "type": "object",
          "properties": {
            "ip": {
              "type": "ip",
              "doc_values": true
            },
            "latitude": {
              "type": "float",
              "doc_values": true
            },
            "location": {
              "type": "geo_point",
              "doc_values": true
            },
            "longitude": {
              "type": "float",
              "doc_values": true
            }
          }
        },
        "@version": {
          "index": "not_analyzed",
          "type": "string",
          "doc_values": true
        }
      }
    },
    "tornado_iis_advanced": {
      "properties": {
        "args": {
          "properties": {
            "version": {
              "norms": {
                "enabled": false
              },
              "type": "string",
              "fields": {
                "raw": {
                  "ignore_above": 256,
                  "index": "not_analyzed",
                  "type": "string"
                }
              }
            }
          }
        }
      }
    }
  },
  "aliases": {}
}

But when I change the field message to something else for example size, the problem happens again.

{"index":{"_index":"logstash-2016.02.01","_type":"tweet"}}
{"size":0,"test":20,"status":0}
{"index":{"_index":"logstash-2016.02.01","_type":"tweet"}}
{"size":"value1","test":"test","status":"0"}
{"index":{"_index":"logstash-2016.02.01","_type":"twoot"}}
{"size":"value1","test":"test","status":"0"}
{"index":{"_index":"logstash-2016.02.01","_type":"twaat"}}
{"size":"value1","test":"test","status":"0"}

clintongormley commented 8 years ago

Thanks @sombut - with one master and two data nodes I'm able to recreate this issue in 2.1.1 and in 2.1.2.. The good news is that I can't recreate it on 2.2.0 (which we are close to releasing).

sombut commented 8 years ago

Thank you @clintongormley for the good news. I'll wait for the 2.2.0 release. Do you know when will it be release please?

jpountz commented 8 years ago

This seems to be fixed by #15142.

I'll wait for the 2.2.0 release. Do you know when will it be release please?

It will be out "soon" after we fix the last remaining issues: https://github.com/elastic/elasticsearch/labels/v2.2.0

sombut commented 8 years ago

Thank you @jpountz

elastic / elasticsearch

Mapping conflicts cause the infinite shard allocation loop #16139