elastic / stream2es

Stream data into ES (Wikipedia, Twitter, stdin, or other ESes)
355 stars 60 forks source link

Settings option is not applied #69

Open lukas-vlcek opened 7 years ago

lukas-vlcek commented 7 years ago

It seems that the --settings option is not applied. The following is repro script for the wiki use case.

$ ./stream2es --version
2017-01-20T13:03:47.629+0000 INFO  stream2es 20161020121123fe262bd

$ export ESURL=http://10.40.2.198:9200
$ curl ${ESURL}
{
  "name" : "Mikey",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "5vscFAPyRAqiX75Gwz5n9Q",
  "version" : {
    "number" : "2.4.1",
    "build_hash" : "c67dc32e24162035d18d6fe1e952c4cbcbe79d16",
    "build_timestamp" : "2016-09-27T18:57:55Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.2"
  },
  "tagline" : "You Know, for Search"
}

# Starting with empty cluster
$ curl ${ESURL}/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size

# Let's start indexing wiki. Stop the task after 10 seconds.
nohup \
./stream2es wiki \
   --target ${ESURL}/wiki \
   --clobber true \
   --settings '{ "settings": { "index": { "number_of_shards": 5, "number_of_replicas": 1 }}}' \
>/dev/null 2>&1 &
sleep 10
kill $!

# Index "wiki" has the default number of shards and no replicas. Why?
$ curl ${ESURL}/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size 
green  open   wiki    2   0        193            0      3.5mb          3.5mb

# Create "test" index manually using the same settings
$ curl -X PUT ${ESURL}/test/ -d '{ "settings": { "index": { "number_of_shards": 5, "number_of_replicas": 1 }}}'
{"acknowledged":true}

# Compare "wiki" vs "test"
$ curl ${ESURL}/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size 
yellow open   test    5   1          0            0       260b           260b 
green  open   wiki    2   0        193            0      3.5mb          3.5mb

Relevant server log:

[2017-01-20 14:17:59,572][INFO ][cluster.metadata         ] [Mikey] [wiki] creating index, cause [api], templates [], shards [2]/[0], mappings [_default_]
[2017-01-20 14:17:59,860][INFO ][cluster.routing.allocation] [Mikey] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[wiki][0], [wiki][0]] ...]).
[2017-01-20 14:18:08,795][INFO ][cluster.metadata         ] [Mikey] [wiki] create_mapping [redirect]
[2017-01-20 14:18:08,922][INFO ][cluster.metadata         ] [Mikey] [wiki] create_mapping [page]
[2017-01-20 14:18:09,581][INFO ][cluster.metadata         ] [Mikey] [wiki] create_mapping [disambiguation]
[2017-01-20 14:18:16,685][INFO ][cluster.metadata         ] [Mikey] [wiki] update_mapping [disambiguation]
[2017-01-20 14:18:23,178][INFO ][cluster.metadata         ] [Mikey] [wiki] update_mapping [redirect]
fupolarbear commented 7 years ago

The problem is that the author gives a wrong/(outdated?) settings example in readme, so the settings and mappings are parsed incorrectly and have no affect to ES. If you follow the author's example, you will find your ES index is configured incorrectly like below:

curl -XGET 'http://localhost:9200/_all/_settings?pretty'
...
  "wiki" : {
    "settings" : {
      "index" : {
        "settings" : {
          "index" : {
            "analysis" : {
              "analyzer" : {
...

it's absolutely wrong, so actually you should do like this:

java -DentityExpansionLimit=2147480000 -DtotalEntitySizeLimit=2147480000 -Djdk.xml.totalEntitySizeLimit=2147480000 -Xmx2g -jar stream2es wiki --log debug --source 'enwiki-20170401-pages-articles.xml.bz2' --settings '
{
    "number_of_shards" : 1,
    "analysis" : {
        "analyzer" : {
            "default":{
                "type" : "snowball",
                "language" : "English"
            }
        }
    }
}'

Those strange JVM opts is for another code issue issues 65. Hope it can help you.