elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
14.1k stars 3.48k forks source link

Logstash fails to acquire lock on PQ lockfile (reentrant?) #10572

Open yaauie opened 5 years ago

yaauie commented 5 years ago

On Logstash versions 6.2.4 and 6.5.4, I have seen reports of the Logstash Persistent Queue failing to acquire a lock on a lockfile, with the message indicating that the lock is already held in the current JVM process.

[2019-03-19T11:04:26,987][ERROR][logstash.pipeline ] Logstash failed to create queue {:pipeline_id=>"[redacted]", "exception"=>"org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by this virtual machine on lock path: [redacted]", "backtrace"=>["org/logstash/ackedqueue/Queue.java:258...

I do not presently have sufficient steps for consistent reproduction, but am working to chase those down.

In at least one case, the user was using multiple pipelines via pipelines.yml, the issue was occurring at startup, and the pipelines did not have any pipeline-specific queue settings.

colinsurprenant commented 5 years ago

might be related to #10715 which I will investigate.

Aqualie commented 4 years ago

Experiencing this problem with logstash 7.5.1

[2020-01-15T06:57:30,950][ERROR][org.logstash.execution.AbstractPipelineExt] Logstash failed to create queue.
org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by this virtual machine on lock path: /opt/elastic/logstash/queue/pfsense/.lock
    at org.logstash.ackedqueue.Queue.open(Queue.java:263) ~[logstash-core.jar:?]
    at org.logstash.ackedqueue.ext.JRubyAckedQueueExt.open(JRubyAckedQueueExt.java:101) ~[logstash-core.jar:?]
    at org.logstash.ackedqueue.ext.JRubyWrappedAckedQueueExt.initialize(JRubyWrappedAckedQueueExt.java:42) ~[logstash-core.jar:?]
    at org.logstash.ackedqueue.QueueFactoryExt.create(QueueFactoryExt.java:39) ~[logstash-core.jar:?]
    at org.logstash.execution.AbstractPipelineExt.openQueue(AbstractPipelineExt.java:169) [logstash-core.jar:?]
    at org.logstash.execution.AbstractPipelineExt$INVOKER$i$0$0$openQueue.call(AbstractPipelineExt$INVOKER$i$0$0$openQueue.gen) [jruby-complete-9.2.6.0.jar:?]
    at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:813) [jruby-complete-9.2.6.0.jar:?]
    at org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:183) [jruby-complete-9.2.6.0.jar:?]
    at opt.elastic.logstash.logstash_minus_core.lib.logstash.java_pipeline.RUBY$method$initialize$0(/opt/elastic/logstash/logstash-core/lib/logstash/java_pipeline.rb:24) [jruby-complete-9.2.6.0.jar:?]
    at opt.elastic.logstash.logstash_minus_core.lib.logstash.pipeline_action.create.RUBY$method$execute$0(/opt/elastic/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:36) [jruby-complete-9.2.6.0.jar:?]
    at opt.elastic.logstash.logstash_minus_core.lib.logstash.agent.RUBY$block$converge_state$2(/opt/elastic/logstash/logstash-core/lib/logstash/agent.rb:325) [jruby-complete-9.2.6.0.jar:?]
    at org.jruby.runtime.CompiledIRBlockBody.callDirect(CompiledIRBlockBody.java:136) [jruby-complete-9.2.6.0.jar:?]
    at org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:77) [jruby-complete-9.2.6.0.jar:?]
    at org.jruby.runtime.Block.call(Block.java:124) [jruby-complete-9.2.6.0.jar:?]
    at org.jruby.RubyProc.call(RubyProc.java:286) [jruby-complete-9.2.6.0.jar:?]
    at org.jruby.RubyProc.call(RubyProc.java:270) [jruby-complete-9.2.6.0.jar:?]
    at org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:105) [jruby-complete-9.2.6.0.jar:?]
    at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-01-15T06:57:30,952][ERROR][logstash.agent           ] Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:pfsense, :exception=>"Java::JavaLang::IllegalStateException", :message=>"org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by this virtual machine on lock path: /opt/elastic/logstash/queue/pfsense/.lock", :backtrace=>["org.logstash.execution.AbstractPipelineExt.openQueue(AbstractPipelineExt.java:172)", "org.logstash.execution.AbstractPipelineExt$INVOKER$i$0$0$openQueue.call(AbstractPipelineExt$INVOKER$i$0$0$openQueue.gen)", "org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:813)", "org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:183)", "opt.elastic.logstash.logstash_minus_core.lib.logstash.java_pipeline.RUBY$method$initialize$0(/opt/elastic/logstash/logstash-core/lib/logstash/java_pipeline.rb:24)", "opt.elastic.logstash.logstash_minus_core.lib.logstash.pipeline_action.create.RUBY$method$execute$0(/opt/elastic/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:36)", "opt.elastic.logstash.logstash_minus_core.lib.logstash.agent.RUBY$block$converge_state$2(/opt/elastic/logstash/logstash-core/lib/logstash/agent.rb:325)", "org.jruby.runtime.CompiledIRBlockBody.callDirect(CompiledIRBlockBody.java:136)", "org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:77)", "org.jruby.runtime.Block.call(Block.java:124)", "org.jruby.RubyProc.call(RubyProc.java:286)", "org.jruby.RubyProc.call(RubyProc.java:270)", "org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:105)", "java.base/java.lang.Thread.run(Thread.java:835)"]}
[2020-01-15T06:57:30,977][INFO ][logstash.javapipeline    ] Pipeline started {"pipeline.id"=>".monitoring-logstash"}
[2020-01-15T06:57:31,100][FATAL][logstash.runner          ] An unexpected error occurred! {:error=>#<LogStash::Error: Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<vmware>`>, :backtrace=>["org/logstash/execution/ConvergeResultExt.java:109:in `create'", "org/logstash/execution/ConvergeResultExt.java:37:in `add'", "/opt/elastic/logstash/logstash-core/lib/logstash/agent.rb:338:in `block in converge_state'"]}
[2020-01-15T06:57:31,243][ERROR][org.logstash.Logstash    ] java.lang.IllegalStateException: Logstash stopped processing because of an error: (SystemExit) exit

Tried to stop the logstash process, delete the .lock file manually and start logstash again but did not help still seeing the error message.. Running bin/pqcheck against the directory:

 »  bin/pqcheck queue/pfsense                                                            /opt/elastic/logstash
Java HotSpot(TM) 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
Checking queue dir: queue/pfsense
checkpoint.head, fully-acked: YES, page.465 size: 67108864
pageNum=465, firstUnackedPageNum=465, firstUnackedSeqNum=65672857, minSeqNum=65601460, elementCount=71397, isFullyAcked=yes
 »     

My queue/beats pipeline failed for whatever reason roughly 11 hours earlier and starting to queue everything to disk elasticsearch cluster health is GREEN... still looking into that pipeline issue.

colinsurprenant commented 4 years ago

@Aqualie this is likely not a PQ issue but more the result of a pipeline exception/error that does not correctly closes the PQ (and thus not releasing the lock) and upon restarting the failed pipeline it complains about failing to acquire the lock.

Can you verify the logs prior to this problem to see if you can trace any errors which might help explain why the lock has not been released? (look for pipeline reload failure, pipeline error/exception or any warning/error logs for that matter).

Also I have seen some cases where this problem occurred because of permissions problems on the PQ data files. Can you look for IO/File/Permissions errors? Also double-check that the file permissions in the PQ data dir are ok.

colinsurprenant commented 4 years ago

There is a related issue specific to the permission problem #10715

Harsimran1702 commented 4 years ago

Facing this error on 7.6.2

[2020-05-26T21:57:29,449][ERROR][org.logstash.execution.AbstractPipelineExt] Logstash failed to create queue.
org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by another program on lock path: /app/queue/main/.lock
        at org.logstash.ackedqueue.Queue.open(Queue.java:263) ~[logstash-core.jar:?]
        at org.logstash.ackedqueue.ext.JRubyAckedQueueExt.open(JRubyAckedQueueExt.java:100) ~[logstash-core.jar:?]
        at org.logstash.ackedqueue.ext.JRubyWrappedAckedQueueExt.initialize(JRubyWrappedAckedQueueExt.java:42) ~[logstash-core.jar:?]
        at org.logstash.ackedqueue.QueueFactoryExt.create(QueueFactoryExt.java:39) ~[logstash-core.jar:?]
        at org.logstash.execution.AbstractPipelineExt.openQueue(AbstractPipelineExt.java:176) [logstash-core.jar:?]
        at org.logstash.execution.AbstractPipelineExt$INVOKER$i$0$0$openQueue.call(AbstractPipelineExt$INVOKER$i$0$0$openQueue.gen) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:831) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:183) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.logstash_minus_core.lib.logstash.java_pipeline.RUBY$method$initialize$0(/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:28) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:84) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:107) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:140) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:386) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:184) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.RubyClass.newInstance(RubyClass.java:894) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.RubyClass$INVOKER$i$newInstance.call(RubyClass$INVOKER$i$newInstance.gen) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroOrOneOrNBlock.call(JavaMethod.java:349) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:375) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:174) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:316) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:116) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:143) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.IRBlockBody.doYield(IRBlockBody.java:185) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.BlockBody.yield(BlockBody.java:116) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.Block.yield(Block.java:170) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.RubyArray.each(RubyArray.java:1814) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.RubyArray$INVOKER$i$0$0$each.call(RubyArray$INVOKER$i$0$0$each.gen) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroBlock.call(JavaMethod.java:555) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:332) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:86) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:93) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.instructions.CallBase.interpret(CallBase.java:537) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:361) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:86) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:73) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:183) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.clamp_minus_0_dot_6_dot_5.lib.clamp.command.RUBY$method$run$0(/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/clamp-0.6.5/lib/clamp/command.rb:67) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.clamp_minus_0_dot_6_dot_5.lib.clamp.command.RUBY$method$run$0$__VARARGS__(/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/clamp-0.6.5/lib/clamp/command.rb) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:84) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:70) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.runtime.IRRuntimeHelpers.instanceSuper(IRRuntimeHelpers.java:1169) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.runtime.IRRuntimeHelpers.instanceSuperSplatArgs(IRRuntimeHelpers.java:1156) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.targets.InstanceSuperInvokeSite.invoke(InstanceSuperInvokeSite.java:39) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.logstash_minus_core.lib.logstash.runner.RUBY$method$run$0(/usr/share/logstash/logstash-core/lib/logstash/runner.rb:242) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.logstash_minus_core.lib.logstash.runner.RUBY$method$run$0$__VARARGS__(/usr/share/logstash/logstash-core/lib/logstash/runner.rb) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:84) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:70) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:183) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.clamp_minus_0_dot_6_dot_5.lib.clamp.command.RUBY$method$run$0(/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/clamp-0.6.5/lib/clamp/command.rb:132) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:84) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:70) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:183) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.lib.bootstrap.environment.RUBY$script(/usr/share/logstash/lib/bootstrap/environment.rb:73) [jruby-complete-9.2.9.0.jar:?]
        at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710) [?:?]
        at org.jruby.ir.Compiler$1.load(Compiler.java:89) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.Ruby.runScript(Ruby.java:857) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.Ruby.runNormally(Ruby.java:780) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.Ruby.runNormally(Ruby.java:798) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.Ruby.runFromMain(Ruby.java:610) [jruby-complete-9.2.9.0.jar:?]
        at org.logstash.Logstash.run(Logstash.java:113) [logstash-core.jar:?]
        at org.logstash.Logstash.main(Logstash.java:47) [logstash-core.jar:?]
[2020-05-26T21:57:29,492][FATAL][logstash.runner          ] The given configuration is invalid. Reason: org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by another program on lock path: /app/queue/main/.lock
[2020-05-26T21:57:29,497][ERROR][org.logstash.Logstash    ] java.lang.IllegalStateException: Logstash stopped processing because of an error: (SystemExit) exit

This is observed when running a configuration check while the logstash process is already running:

sudo -u logstash /usr/share/logstash/bin/logstash --path.settings /etc/logstash --config.test_and_exit

rahuldeyfb33 commented 2 years ago

Can somebody has a solution to this error because I am getting the same error: The queue failed to obtain exclusive access, cause: Lock held by this virtual machine on lock path: /usr/share/logstash/queue/main/.lock

I am running dockerized logstash, logstash version 7.4.2

yaauie commented 2 years ago

There are two ways that the PQ can fail to acquire a lock:

  1. A single Logstash is trying to obtain a lock that it already holds; the error message will include:

    cause: Lock held by this virtual machine

    These cases indicate that a single Logstash process acquired the lock on the queue, encountered some error and failed to fully release the lock on the queue, and is now attempting to re-acquire the lock.

    Most cases that I have seen stem from insufficient disk capacity when the pipeline is first starting (logs contain Not enough free disk space available to allocate persisted queue with further details). I have seen other reports that permissions issues in the queue directory can cause a partially-acquired lock to enter this state.

    In any case, it is only after encountering the first issue and making a second attempt to acquire the lock that we fail to do so. Internally, Logstash is in a state where it cannot recover and must be restarted.

    These cases can be resolved by shutting down the Logstash process, resolving the causing issue (such as ensuring that there is sufficient free disk space on the volume holding the queue, or fixing permissions), and starting Logstash up again. The other logs prior to this message will be instrumental in determining the root cause.

  2. Multiple Logstash processes are trying to obtain a lock on a single queue; the error message will include:

    cause: Lock held by another program

    These cases are caused by one Logstash process attempting to acquire a lock on the queue when another Logstash process already holds it. Only one Logstash instance can safely access any particular queue at any given time.

    This case can be solved by stopping the other Logstash process before starting up a new logstash process directed at the same queue.

rahuldeyfb33 commented 2 years ago

@yaauie Thanks for your reply. To Elaborate further I am running a dockerized logstash to ingest logs from different sources, parse/filter them and then store them in Elasticsearch. The docker container is running in a linux virtual machine. Recently all the logstash pipelines are continuously failing with this error. bidos-logstash_1 | warning: thread "Converge PipelineAction::Create<webscada_stream.conf>" terminated with exception (report_on_exception is true): bidos-logstash_1 | LogStash::Error: Don't know how to handle Java::JavaLang::IllegalStateException for PipelineAction::Create<webscada_stream.conf> bidos-logstash_1 | create at org/logstash/execution/ConvergeResultExt.java:109 bidos-logstash_1 | add at org/logstash/execution/ConvergeResultExt.java:37 bidos-logstash_1 | converge_state at /usr/share/logstash/logstash-core/lib/logstash/agent.rb:339 bidos-logstash_1 | [2021-09-02T12:11:01,107][ERROR][org.logstash.execution.AbstractPipelineExt] Logstash failed to create queue. bidos-logstash_1 | org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by this virtual machine on lock path: /usr/share/logstash/data/queue/webscada_stream.conf/.lock

"usr/share/logstash/data/queue" is not a directory present in the virtual machine. It is only present inside the docker container. So we cannot change the permissions or remove files from this directory.

Docker Image layers looks like this: `

groupadd --gid 1000 logstash && adduser --uid 1000 --gid 1000 --home-dir /usr/share/logstash --no-create-home logstash

curl -Lo - http://localhost:8000/logstash-7.4.2.tar.gz | tar zxf - -C /usr/share && mv /usr/share/logstash-7.4.2 /usr/share/logstash && chown --recursive logstash:logstash /usr/share/logstash/ && chown -R logstash:root /usr/share/logstash && chmod -R g=u /usr/share/logstash && find /usr/share/logstash -type d -exec chmod g+s {} ; && ln -s /usr/share/logstash /opt/logstash

WORKDIR /usr/share/logstash

`

have deleted all the docker images, containers, volumes using ''sudo docker system prune -all" and then pulled the docker logstash images again. After trying to build the docker images and starting docker containers, this lock error still persist.

I read in a thread that this might happen due to an orphaned process running in the virtual machine. So I rebooted the virtual machine to get rid of the orphaned processes but this error still persist. Can somebody explain what is going wrong here. We are trying to solve this error for the last 10 days, but could not get any success.

rahuldeyfb33 commented 2 years ago

More details, I am running multiple pipelines which are defined in pipelines.yml and queue type is defined as "persisted"

sasikiranvaddi commented 1 month ago

We observed LockException, please find more details in the below topic1. Is there any fix available or work in progress for the issue mentioned.

breml commented 2 weeks ago

@yaauie Since some time we observe that from time to time (~1 week), our Logstash processing comes to a complete stop and we have to restart our Logstash instances. We tried to make sense of the logs, but so far we have not yet found the root cause of the issue. That being said, in some cases we observed the following log entries around the time when the blocking happened:

[2024-07-01T07:58:01,454][ERROR][org.logstash.execution.AbstractPipelineExt] Logstash failed to create queue.
org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by this virtual machine on lock path: /apps/elk/logstash/queue/process_log_formats_k8s/.lock
        at org.logstash.ackedqueue.Queue.open(Queue.java:176) ~[logstash-core.jar:?]
        at org.logstash.ackedqueue.ext.JRubyAckedQueueExt.open(JRubyAckedQueueExt.java:127) ~[logstash-core.jar:?]
        at org.logstash.ackedqueue.ext.JRubyWrappedAckedQueueExt.initialize(JRubyWrappedAckedQueueExt.java:64) ~[logstash-core.jar:?]
        at org.logstash.ackedqueue.QueueFactoryExt.create(QueueFactoryExt.java:84) ~[logstash-core.jar:?]
        at org.logstash.execution.AbstractPipelineExt.openQueue(AbstractPipelineExt.java:257) [logstash-core.jar:?]
        at org.logstash.execution.AbstractPipelineExt$INVOKER$i$0$0$openQueue.call(AbstractPipelineExt$INVOKER$i$0$0$openQueue.gen) [jruby.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:456) [jruby.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:195) [jruby.jar:?]
        at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:350) [jruby.jar:?]
        at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66) [jruby.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:128) [jruby.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:115) [jruby.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:90) [jruby.jar:?]
        at org.jruby.RubyClass.newInstance(RubyClass.java:931) [jruby.jar:?]
        at org.jruby.RubyClass$INVOKER$i$newInstance.call(RubyClass$INVOKER$i$newInstance.gen) [jruby.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:90) [jruby.jar:?]
        at org.jruby.ir.instructions.CallBase.interpret(CallBase.java:548) [jruby.jar:?]
        at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:367) [jruby.jar:?]
        at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66) [jruby.jar:?]
        at org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:88) [jruby.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:238) [jruby.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:225) [jruby.jar:?]
        at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:228) [jruby.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:291) [jruby.jar:?]
        at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:328) [jruby.jar:?]
        at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66) [jruby.jar:?]
        at org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:116) [jruby.jar:?]
        at org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:136) [jruby.jar:?]
        at org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:66) [jruby.jar:?]
        at org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:58) [jruby.jar:?]
        at org.jruby.runtime.Block.call(Block.java:144) [jruby.jar:?]
        at org.jruby.RubyProc.call(RubyProc.java:352) [jruby.jar:?]
        at org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:111) [jruby.jar:?]
        at java.lang.Thread.run(Thread.java:840) [?:?]

I looked into these cases a little bit more and I found the following log entries, that seemed to be at the start of the locking issue:

[2024-07-01T07:57:21,365][ERROR][logstash.agent           ] Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:process_log_formats_k8s, :exception=>"KeyError", :message=>"key not found", :backtrace=>["/usr/share/logstash/vendor/bundle/jruby/3.1.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/map.rb:325:in `raise_fetch_no_key'", "/usr/share/logstash/vendor/bundle/jruby/3.1.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/map.rb:186:in `fetch'", "/usr/share/logstash/vendor/bundle/jruby/3.1.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/map.rb:202:in `fetch_or_store'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:85:in `block in fetch_or_store'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:83:in `block in fetch_or_store'", "org/jruby/ext/thread/Mutex.java:171:in `synchronize'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:80:in `fetch_or_store'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/collector.rb:92:in `register?'", "org/logstash/execution/AbstractPipelineExt.java:540:in `initialize_flow_metrics'", "/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:127:in `start'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:51:in `block in execute'", "/usr/share/logstash/logstash-core/lib/logstash/pipelines_registry.rb:159:in `create_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:50:in `execute'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:386:in `block in converge_state'"]}
[2024-06-14T06:59:47,501][ERROR][logstash.agent           ] Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:process_log_formats_csconnector, :exception=>"LocalJumpError", :message=>"yield called out of block", :backtrace=>["/usr/share/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:84:in `block in fetch_or_store'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:83:in `block in fetch_or_store'", "org/jruby/ext/thread/Mutex.java:171:in `synchronize'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:80:in `fetch_or_store'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/collector.rb:92:in `register?'", "org/logstash/execution/AbstractPipelineExt.java:540:in `initialize_flow_metrics'", "/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:127:in `start'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:51:in `block in execute'", "/usr/share/logstash/logstash-core/lib/logstash/pipelines_registry.rb:159:in `create_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:50:in`execute'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:386:in `block in converge_state'"]}
[2023-11-21T01:56:18,955][ERROR][logstash.agent           ] Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:heartbeat, :exception=>"Java::JavaLang::NullPointerException", :message=>"Cannot invoke \"org.logstash.instrument.metrics.Metric.getValue()\" because \"this.numeratorMetric\" is null", :backtrace=>["org.logstash.instrument.metrics.BaseFlowMetric.doCapture(BaseFlowMetric.java:82)", "org.logstash.instrument.metrics.BaseFlowMetric.<init>(BaseFlowMetric.java:66)", "org.logstash.instrument.metrics.ExtendedFlowMetric.<init>(ExtendedFlowMetric.java:70)", "org.logstash.instrument.metrics.ExtendedFlowMetric.<init>(ExtendedFlowMetric.java:80)", "org.logstash.instrument.metrics.ExtendedFlowMetric.<init>(ExtendedFlowMetric.java:62)", "org.logstash.instrument.metrics.FlowMetric.create(FlowMetric.java:39)", "org.logstash.execution.AbstractPipelineExt.createFlowMetric(AbstractPipelineExt.java:553)", "org.logstash.execution.AbstractPipelineExt.initializePluginThroughputFlowMetric(AbstractPipelineExt.java:650)", "org.logstash.execution.AbstractPipelineExt.initializePluginFlowMetrics(AbstractPipelineExt.java:631)", "org.logstash.execution.AbstractPipelineExt.initializeFlowMetrics(AbstractPipelineExt.java:540)", "org.logstash.execution.AbstractPipelineExt$INVOKER$i$0$0$initializeFlowMetrics.call(AbstractPipelineExt$INVOKER$i$0$0$initializeFlowMetrics.gen)", "org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:351)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:144)", "org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:345)", "org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72)", "org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:80)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:164)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:151)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:351)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:144)", "org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:345)", "org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72)", "org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:116)", "org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:136)", "org.jruby.runtime.IRBlockBody.yieldSpecific(IRBlockBody.java:76)", "org.jruby.runtime.Block.yieldSpecific(Block.java:157)", "org.jruby.ir.runtime.IRRuntimeHelpers.yieldSpecific(IRRuntimeHelpers.java:498)", "org.jruby.ir.instructions.YieldInstr.interpret(YieldInstr.java:76)", "org.jruby.ir.interpreter.StartupInterpreterEngine.processOtherOp(StartupInterpreterEngine.java:176)", "org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:104)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:128)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:115)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:85)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:94)", "org.jruby.ir.instructions.CallBase.interpret(CallBase.java:546)", "org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:361)", "org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72)", "org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:92)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:238)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:225)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:226)", "org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:393)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:206)", "org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:325)", "org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72)", "org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:116)", "org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:136)", "org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:66)", "org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:58)", "org.jruby.runtime.Block.call(Block.java:143)", "org.jruby.RubyProc.call(RubyProc.java:309)", "org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:107)", "java.base/java.lang.Thread.run(Thread.java:833)"]}

It affected not always the same pipeline, but common denominator is, that there was some sort of issue while initially creating the pipeline (LogStash::PipelineAction::Create). It also looks like it is not always the same reason why the creation of the pipeline fails.

In my case, it is always Lock held by this virtual machine on lock and the reason has never been a full disk.

As of now, we use Logstash 8.12.1 on Linux (rpm installation, no containers). That being said, we observed the issue already with Logstash 8.7.x and we performed the update to 8.12.1 in the hope, that it would maybe resolve the issue. But this was unfortunately not the case.

Our Logstash configuration makes use of pipelines as well as pipeline to pipeline communication. In total there are 14 pipelines and most of them are part of a bigger processing system where we use the distributor as well as the collector pattern.

I hope, the information provided does help in debugging this issue.

breml commented 2 weeks ago

I made an additional observation in regards to the log entries I mentioned in the previous post.

When Logstash is in the above mentioned error state, where a pipeline can not be created because of the lock not being available (Logstash failed to create queue. org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by this virtual machine), the recreation of the pipeline is retried ~40 times per minute. What I now observed is, that for every retry a new instance of the "plain" codec is created on every failed retry. I can see this, because I have a cronjob, which saves the result of http://localhost:9600/_node/stats/pipelines?pretty to disk every minute and there I can see, that the array in .pipelines.<pipeline_name>.plugins.codecs grows by 40 entries every minute, if Logstash is in this failure state. If the number of codecs is somewhere around 4200, Logstash completely stops the processing.

I created the cronjob to get better insights about what is happening before Logstash stops processing events.