Open PSiAU opened 7 years ago
If Logstash is not shutting down (the default stop timeout in our sysv init script is 5 seconds), this generally indicates Logstash is still busy with work and will terminate only after that work is done.
This is normal, especially if something downstream (elasticsearch, or any other output) is slow or down.
If you wish to kill the logstash process, accepting that you will lose data because of this kill proceedure, you can set KILL_ON_STOP_TIMEOUT=1 in your environment before telling Logstash to stop with the init script.
Ok. I can certainly increase the timeout myself, but in the interest of catering to a wider variety of environments, would it make sense to increase this timeout to 10 seconds? 10 works just fine in my setup.
for i in 1 2 3 4 5 6 7 8 9 10; do
well, I don't know if there's a "good" timeout, and we don't have data to determine what a good timeout would be.
If logstash can't terminate within 5 seconds, it's likely stuck waiting for some external service to respond. I don't have data, but my assumption (without data) is that 5 seconds vs 10 seconds isn't going to show any meaningful difference in shutdown timeout behavior.
If we had data from a wider variety of systems that showed how long Logstash would take to shutdown, we could use that data to determine a good default value.
Can you try 10 seconds in your environment and report back? (Ideally reporting after having made several observations, not just one, but any data is better than none)
Alternately, would you prefer a "wait forever" option?
Logstash takes about 9 - 10 seconds to shutdown for me. I don't think waiting forever is a good idea though, so perhaps set it to 15 seconds to allow for a little more time.
@PSiAU that seems long. I'd like to understand why it takes so long to terminate.
This is all I see in my logs during shutdown:
[2017-04-21T16:27:30,341][WARN ][logstash.runner ] SIGTERM received. Shutting down the agent.
[2017-04-21T16:27:30,350][WARN ][logstash.agent ] stopping pipeline {:id=>"main"}
[2017-04-21T16:27:38,976][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{}}
[2017-04-21T16:27:38,977][ERROR][logstash.shutdownwatcher ] The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information.
Can you run with --log.level=debug
?
And can you also attach your logstash configuration? (logstash.yml and any other configs)
Ok files attached. Debug output and my configs.
I'm seeing the same issue on CentOS 7 (LS 1.5.4-1), on an idle test machine. A while kill $pid; do sleep 1; done
hasn't been able to kill it for several minutes. Sample strace
:
futex(0x7f0137eb99d0, FUTEX_WAIT, 1901, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=14664, si_uid=0} ---
futex(0x7f01395c6b00, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigreturn() = 202
futex(0x7f0137eb99d0, FUTEX_WAIT, 1901, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=14664, si_uid=0} ---
futex(0x7f01395c6b00, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigreturn() = 202
After stopping the while
loop strace
shows absolutely no activity. The configuration is otherwise stable and has been used unchanged in production for months.
This is happening to me too when I'm trying to shut down a container in which both elasticsearch and logstash are running. It hangs until systemd kills it.
I suspect that it is because of this what @jordansissel said:
especially if something downstream (elasticsearch [...]) is [...] down.
I suspect that elasticsearch behaves well and shuts down at sigterm immediately, but then logstash still tries to talk to it, and hangs.
Is that possible?
@jordansissel Are you able to increase the timeout to 10 seconds for the moment?
I didn't want to open a new issue for this, since there is already an open one here.
I have the same experience in Logstash 7.2.0, in that if logstash is busy, or stuck doing something, and I send a SIGTERM to it, either via service, or systemctl, they both just hang for 30+ seconds. Even attempting pkill logstash doesn't work.
I can recreate this easily if I shutdown Elasticsearch first, and Logstash continues to try and reconnect to the "dead es instance".
systemctl stop elasticsearch systemctl stop logstash hangs
logstash-plain.log fills with:
[2020-02-28T20:42:59,968][WARN ][logstash.outputs.elasticsearch][main] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"http://elastic:xxxxxx@localhost:9200/", :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [http://elastic:xxxxxx@localhost:9200/][Manticore::SocketException] Connection refused (Connection refused)"}
Eventually it will shut down, I've had it take over 2 minutes before. Possibly some condition in logstash which is not accepting the sigterm because it is doing other tasks?
Anyway, i've learned to live with it, and just shut logstash down first. Just wanted to throw in my two cents.
This is happening to me too when I'm trying to shut down a container in which both elasticsearch and logstash are running. It hangs until systemd kills it.
I suspect that it is because of this what @jordansissel said:
especially if something downstream (elasticsearch [...]) is [...] down.
I suspect that elasticsearch behaves well and shuts down at sigterm immediately, but then logstash still tries to talk to it, and hangs.
Is that possible?
I had added...
[Unit]
After=elasticsearch.service
...to /etc/systemd/system/logstash.service.d/logstash.conf. Which in theory should force ElasticSearch to await Logstash to such down, and I still intermittently see this issue.
I've since added...
[Service]
TimeoutStopSec=300
...to hopefully force shutdown, as on Ubuntu 20.04 things hang up for a significantly long time >50m for me. The improper shutdown that will cause isn't ideal though.
I've been facing this issue periodically for months currently with Logstash 7.10 running on Debian buster. At the moment I've been waiting for Logstash to stop for 45 minutes after i ran systemctl stop logstash
. I intrerrupted the command and tried to reboot the machine but it doesn't seem to come back up again. Then I issued a reboot from the EC2 console but not until I choose the "Stop instance" action it stopped after some more minutes.
Stopping logstash via "service logstash stop" times out with the error "logstash stop failed; still running." after only a few seconds. It appears Logstash is taking longer to shutdown than what the shutdown scripts wait for.