elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.17k stars 24.84k forks source link

Elasticsearch doesn't handle `SIGTERM` correctly #45236

Closed tetianakravchenko closed 1 year ago

tetianakravchenko commented 5 years ago

Elasticsearch version (bin/elasticsearch --version): 7.2.0, but can be reproduced for previous versions

Description of the problem including expected versus actual behavior:

Elasticsearch is not waiting for already opened connections (search requests) to be completed, so when SIGTERM is sent active connections will be dropped.

Expected behavior: complete already open connections

Actual: all active connection will be dropped, if elasticsearch is stopped during the execution:

curl -sS -XGET -o /dev/null -H 'Content-Type: application/json' http://search-client:9200/_search -d'....'

I'm getting:

curl: (52) Empty reply from server

Steps to reproduce:

Here is an initial issue https://github.com/elastic/helm-charts/issues/190, reproduce steps can be found in this comment: https://github.com/elastic/helm-charts/issues/190#issuecomment-516458091:

  1. 
    #!/usr/bin/env bash

CPUS="0.1" SLEEP_TIME="5"

set -euo pipefail

function sleep_then_stop () { sleep $SLEEP_TIME docker stop -t 60 elasticsearch docker logs --since 30s elasticsearch }

function query_es () { time curl -XGET -o /dev/null -H 'Content-Type: application/json' http://localhost:9200/_search -d' { "query": { "bool": { "must": { "match_all": {} }, "should" : [ { "regexp" : { "ip" : ".<1-92><1-5><0-7>" } }, { "regexp" : { "agent" : ".oz5" } }, { "regexp" : { "referer" : "..tp?://.*scom|org" } } ] } }, "sort": [ { "utc_time": { "order": "desc" }} ], "size": 10000 }' }

if docker inspect elasticsearch; then docker start elasticsearch else docker run --name elasticsearch -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.2.0 fi

docker update --cpus 2 elasticsearch

until curl localhost:9200; do sleep 1; done

if ! [ -f "logs.jsonl" ] ; then curl -O https://download.elastic.co/demos/kibana/gettingstarted/8.x/logs.jsonl.gz gunzip logs.jsonl.gz fi

if ! curl localhost:9200/_cat/indices | grep logstash; then curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/_bulk?pretty' --data-binary @logs.jsonl fi

echo "*****"

echo "Before CPU limit" query_es echo "*****" echo

docker update --cpus $CPUS elasticsearch

echo "" echo "After CPU limit" query_es echo ""

echo "" echo "After CPU limit with graceful stop after $SLEEP_TIME seconds" sleep_then_stop & query_es echo ""

 2. execution `sleep_then_stop` cause `curl: (52) Empty reply from server`

**Provide logs (if relevant)**:

{"type": "server", "timestamp": "2019-07-30T14:52:13,915+0000", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "[gc][20] overhead, spent [501ms] collecting in the last [1.5s]" } {"type": "server", "timestamp": "2019-07-30T14:52:25,717+0000", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "stopping ..." } {"type": "server", "timestamp": "2019-07-30T14:52:26,418+0000", "level": "WARN", "component": "o.e.c.r.a.DiskThresholdMonitor", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "flood stage disk watermark [95%] exceeded on [WmLyycgDR32y1ULuaD1npw][40a5dff89f7f][/usr/share/elasticsearch/data/nodes/0] free: 1.6gb[2.8%], all indices on this node will be marked read-only" } {"type": "server", "timestamp": "2019-07-30T14:52:27,216+0000", "level": "INFO", "component": "o.e.x.w.WatcherService", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "stopping watch service, reason [shutdown initiated]" } {"type": "server", "timestamp": "2019-07-30T14:52:27,614+0000", "level": "INFO", "component": "o.e.x.m.p.l.CppLogMessageHandler", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "[controller/101] [Main.cc@150] Ml controller exiting" } {"type": "server", "timestamp": "2019-07-30T14:52:27,813+0000", "level": "INFO", "component": "o.e.x.m.p.NativeController", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "Native controller process has stopped - no new native processes can be started" } {"type": "server", "timestamp": "2019-07-30T14:52:30,021+0000", "level": "ERROR", "component": "i.n.u.c.D.rejectedExecution", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "Failed to submit a listener notification task. Event loop shut down?" , "stacktrace": ["java.util.concurrent.RejectedExecutionException: event executor terminated", "at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:846) ~[netty-common-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:329) ~[netty-common-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:322) ~[netty-common-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:757) ~[netty-common-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:766) [netty-common-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:423) [netty-common-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:540) [netty-common-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:533) [netty-common-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:106) [netty-common-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.channel.DefaultChannelPromise.setFailure(DefaultChannelPromise.java:89) [netty-transport-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.channel.AbstractChannelHandlerContext.safeExecute(AbstractChannelHandlerContext.java:1020) [netty-transport-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:805) [netty-transport-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:762) [netty-transport-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1031) [netty-transport-4.1.35.Final.jar:4.1.35.Final]", "at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:310) [netty-transport-4.1.35.Final.jar:4.1.35.Final]", "at org.elasticsearch.http.netty4.Netty4HttpChannel.sendResponse(Netty4HttpChannel.java:43) [transport-netty4-client-7.2.0.jar:7.2.0]", "at org.elasticsearch.http.DefaultRestChannel.sendResponse(DefaultRestChannel.java:123) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.rest.RestController$ResourceHandlingHttpChannel.sendResponse(RestController.java:498) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.rest.action.RestResponseListener.processResponse(RestResponseListener.java:37) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.rest.action.RestActionListener.onResponse(RestActionListener.java:47) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.AbstractSearchAsyncAction.onResponse(AbstractSearchAsyncAction.java:316) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.AbstractSearchAsyncAction.onResponse(AbstractSearchAsyncAction.java:51) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.FetchSearchPhase$3.run(FetchSearchPhase.java:213) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:166) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:159) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:120) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:166) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:159) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:206) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:104) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.CountedCollector.countDown(CountedCollector.java:53) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.CountedCollector.onResult(CountedCollector.java:64) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.FetchSearchPhase$2.innerOnResponse(FetchSearchPhase.java:166) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.FetchSearchPhase$2.innerOnResponse(FetchSearchPhase.java:163) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:45) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:29) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:432) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1101) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1182) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1162) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:47) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:30) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.search.SearchService$1.doRun(SearchService.java:347) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758) [elasticsearch-7.2.0.jar:7.2.0]", "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.2.0.jar:7.2.0]", "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]", "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]", "at java.lang.Thread.run(Thread.java:835) [?:?]"] } {"type": "server", "timestamp": "2019-07-30T14:52:31,517+0000", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "stopped" } {"type": "server", "timestamp": "2019-07-30T14:52:31,518+0000", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "closing ..." } {"type": "server", "timestamp": "2019-07-30T14:52:32,115+0000", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "closed" }

elasticmachine commented 5 years ago

Pinging @elastic/es-core-infra

elasticmachine commented 5 years ago

Pinging @elastic/es-docs

jpountz commented 5 years ago

This was discussed in FixitFriday. We don't think that Elasticsearch should keep processing any requests after a SIGTERM has been issued, be they search, index or snapshot requests. We have some internal logic that keeps some tasks running after a SIGTERM has been issued but the goal is only to clean things up properly, not to complete ongoing requests.

We should better document guarantees regarding shutdown of the Elasticsearch process:

wchrisdean commented 5 years ago

[doc issue triage]

williamrandolph commented 3 years ago

It still seems like it would be good to add or improve documentation on node shutdown. It looks to me like adding a paragraph on the "Stopping Elasticsearch" page would be the way to go here, since that page mentions using SIGTERM to stop the Elasticsearch process.

rjernst commented 1 year ago

This issue was address by https://github.com/elastic/elasticsearch/pull/96363