Closed farmaazon closed 1 year ago
It's make my recent testing quite hard, as I must try several times until I get updates (and reading updates I implement).
This looks like a race-condition in the initialization and new GUI not waiting for the startup to complete?
[ERROR] [2023-09-26T14:41:03+02:00] [akka.actor.OneForOneStrategy] Cannot invoke "org.enso.interpreter.instrument.execution.CommandProcessor.invoke(org.enso.interpreter.instrument.command.Command)" because the return value of "org.enso.interpreter.instrument.Handler.commandProcessor()" is null
java.lang.NullPointerException: Cannot invoke "org.enso.interpreter.instrument.execution.CommandProcessor.invoke(org.enso.interpreter.instrument.command.Command)" because the return value of "org.enso.interpreter.instrument.Handler.commandProcessor()" is null
at org.enso.interpreter.instrument.Handler.onMessage(Handler.scala:128)
at org.enso.interpreter.instrument.Endpoint.$anonfun$sendBinary$1(Handler.scala:66)
at org.enso.interpreter.instrument.Endpoint.$anonfun$sendBinary$1$adapted(Handler.scala:64)
at scala.Option.foreach(Option.scala:437)
at org.enso.interpreter.instrument.Endpoint.sendBinary(Handler.scala:64)
at org.graalvm.truffle/com.oracle.truffle.api.instrumentation.TruffleInstrument$Env$MessageTransportProxy$MessageEndpointProxy.sendBinary(TruffleInstrument.java:1090)
at org.enso.languageserver.runtime.RuntimeConnector$$anonfun$initialized$1.applyOrElse(RuntimeConnector.scala:79)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
at org.enso.logger.akka.ActorLoggingReceive.apply(ActorLoggingReceive.scala:35)
at org.enso.logger.akka.ActorLoggingReceive.apply(ActorLoggingReceive.scala:14)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:214)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:213)
at org.enso.logger.akka.ActorLoggingReceive.applyOrElse(ActorLoggingReceive.scala:14)
at scala.PartialFunction$Combined.applyOrElse(PartialFunction.scala:305)
at akka.actor.Actor.aroundReceive(Actor.scala:537)
at akka.actor.Actor.aroundReceive$(Actor.scala:535)
at org.enso.languageserver.runtime.RuntimeConnector.aroundReceive(RuntimeConnector.scala:20)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:579)
at akka.actor.ActorCell.invoke(ActorCell.scala:547)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
And new GUI not waiting for the startup to complete?
There never was any message about "initialization completness" sent from engine, except first "execution status" (which we use to hide "compiling standard library") - but that requires creating execution context first, which often fails in GUI2.
And new GUI not waiting for the startup to complete?
There never was any message about "initialization completness" sent from engine, except first "execution status" (which we use to hide "compiling standard library") - but that requires creating execution context first, which often fails in GUI2.
It's definitely related to that exception since in both cases that failed you were getting this NPE. I will dig into the sequence in more detail.
There never was any message about "initialization completeness"
The response to session/initProtocolConnection
should indicate that the language server is ready. We may need to wait for the Truffle context initialization before sending the response.
I've been analyzing the logs again and something isn't right.
I can see 2 messages "RPC session initialized for client" for different clients in close succession.
session/initProtocolConnection
response?
- Yes, it is. There are two clients, (we have a separate service which synchronizes y-js documents with the language sever),.
OK. Then I'm not surprised of this ticket anymore. In theory, we are able to support multiple clients, especially with collaborative buffer. In practice we probably never exercised initialization enough, nor needed to, to be confident that this will not cause problems.
Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-04):
Progress: Found the potential race condition when two clients initialize instrumentation. Not yet sure how to fix it as I'm not so familiar with that part. It should be finished by 2023-10-05.
Next Day: Next day I will be working on the #7898 task. Come up with a fix.
This looks like a race-condition in the initialization and new GUI not waiting for the startup to complete?
[ERROR] [2023-09-26T14:41:03+02:00] [akka.actor.OneForOneStrategy] Cannot invoke "org.enso.interpreter.instrument.execution.CommandProcessor.invoke(org.enso.interpreter.instrument.command.Command)" because the return value of "org.enso.interpreter.instrument.Handler.commandProcessor()" is null java.lang.NullPointerException: Cannot invoke "org.enso.interpreter.instrument.execution.CommandProcessor.invoke(org.enso.interpreter.instrument.command.Command)" because the return value of "org.enso.interpreter.instrument.Handler.commandProcessor()" is null at org.enso.interpreter.instrument.Handler.onMessage(Handler.scala:128)
Hubert has "walked me thru the code" at https://github.com/enso-org/enso/blob/f1356374e19878316d53a92b7880d3725d2f32f4/engine/runtime-instrument-common/src/main/scala/org/enso/interpreter/instrument/Handler.scala#L108 & co.
The value of commandProcessor
is assigned on line 108 and only then the message InitializedNotification
is sent. Still the field commandProcessor
is null
- this could happen because of violation of Java memory model.
Possibly, the set value of commandProcessor
hasn't yet been stored into global memory when the message receiver tries to access it. Some for of synchronization, transient field or final field is needed.
I feel like this more LS clients feature has opened a pandora's box. Here is another random failure:
[WARN] [2023-10-06T12:54:32+02:00] [enso] Execution of function main failed (Cannot invoke "com.oracle.truffle.api.source.SourceSection.getEndLine()" because the return value of "org.enso.interpreter.node.ClosureRootNode.getSourceSection()" is null).
java.lang.NullPointerException: Cannot invoke "com.oracle.truffle.api.source.SourceSection.getEndLine()" because the return value of "org.enso.interpreter.node.ClosureRootNode.getSourceSection()" is null
at org.enso.interpreter.instrument.IdExecutionInstrument.bind(IdExecutionInstrument.java:255)
at org.enso.interpreter.service.ExecutionService.lambda$execute$1(ExecutionService.java:170)
at java.base/java.util.Optional.map(Optional.java:260)
at org.enso.interpreter.service.ExecutionService.execute(ExecutionService.java:170)
at org.enso.interpreter.service.ExecutionService.execute(ExecutionService.java:219)
at org.enso.interpreter.instrument.job.ProgramExecutionSupport$.executeProgram(ProgramExecutionSupport.scala:111)
at org.enso.interpreter.instrument.job.ProgramExecutionSupport$.$anonfun$runProgram$3(ProgramExecutionSupport.scala:220)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at cats.syntax.EitherObjectOps$.catchNonFatal$extension(either.scala:391)
at org.enso.interpreter.instrument.job.ProgramExecutionSupport$.$anonfun$runProgram$2(ProgramExecutionSupport.scala:220)
at scala.util.Either.flatMap(Either.scala:352)
at org.enso.interpreter.instrument.job.ProgramExecutionSupport$.runProgram(ProgramExecutionSupport.scala:214)
at org.enso.interpreter.instrument.job.ExecuteJob.run(ExecuteJob.scala:42)
at org.enso.interpreter.instrument.job.ExecuteJob.run(ExecuteJob.scala:18)
at org.enso.interpreter.instrument.execution.JobExecutionEngine.$anonfun$runInternal$1(JobExecutionEngine.scala:113)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotThread.access$001(PolyglotThread.java:53)
(once I fixed the previous problem)
Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-05):
Progress: Fixed two locations where we had potential race-condition. PR is up but needs more testing. I think I also saw some random failures still with new GUI. Trying to figure out what is with the failures on CI. It should be finished by 2023-10-05.
Next Day: Next day I will be working on the #7898 task. More testing with new GUI.
Here are the relevant locks that are in the deadlock state:
Essentially the bottleneck is in the endpoint handling of messages which is synchronous.
Since it is probably not so obvious here is a sequence of events that triggers the problem. I'm pretty sure we might have experienced it occasionally with the old GUI except we never got a stracktrace and circumstances were less race-condition friendly:
PushContextCmd
sends a message to Endpoint and starts a background EnsureCompiledJob
OpenFileCmd
is started around the same time EnsureCompiledJob
acquires write lock OpenFileCmd
attempts to acquire read lock but waits on (3) to finish first to get the lockEnsureCompiledJob
in its implementation finds appropriate libraries and sends the message to endpointOpenFileCmd
which is stuck waiting on a lockIn fact, EnsureCompiledJob
might not necessarily be started by PushContextCmd
. There might be just one started in the background some time ago by one of the edits. Also, any command requesting read compilation lock can experience this problem e.g., OpenFileCmd
could be replaced by CloseFileCmd
.
Hubert Plociniczak reports a new 🔴 DELAY for the provided date (2023-10-06):
Summary: There is 6 days delay in implementation of the Unreliable Language Server behavior in the new GUI (#7898) task. It will cause 0 days delay for the delivery of this weekly plan.
Delay Cause: Fixing one problem in the initialization with multiple clients, revealed more bugs that were lurking around.
Hubert Plociniczak reports a new STANDUP for the provided date (2023-10-06):
Progress: Discovered the problem with messages not being unstashed while waiting for initialization. But the fix revealed more transient failures (less frequent than originally reported and at a later stage). Got distracted by investigating CI problems. In the end turned out to be a regression and added a revert in #7953 PR. It should be finished by 2023-10-11.
Next Day: Next day I will be working on the #7898 task. Discover the source of new timeouts in a PR for #7898.
Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-09):
Progress: The source of sporadic timeouts is a rare but still possible deadlock between two types of locks (context locks and lock manager). Details are in the ticket. Still unsure how to solve this properly. Will sync with Radek the next today since he wrote the the latter. It should be finished by 2023-10-11.
Next Day: Next day I will be working on the #7898 task. Find a fix for deadlock.
Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-10):
Progress: Added ordering for a selected number of commands, thus eliminating a deadlock and still maintaining a fix for previous problems. It should be finished by 2023-10-11.
Next Day: Next day I will be working on the #7898 task. More testing, address PR review.
Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-11):
Progress: The previous fix was unnecessarily over-engineered. Re-written it based on the feedback and now the fix is much easier to maintain and review. It should be finished by 2023-10-11.
Next Day: Next day I will be working on the #7978 task. See if my fix also fixed problems with edits people ocassionally see in the old GUI.
Created a follow up ticket for suggestions' db problems https://github.com/enso-org/enso/issues/8033. Other than that I no longer see timeouts.
The new GUI sends its request faster than the old one, and it triggers issues sometimes.
To reproduce:
wip/farmaazon/create-execution-ctx
.npm install
thennpm --workspace=enso-gui2 run dev
Sometimes the execution context creation returns "request timeout" error
Sometimes the RPC methods are successful, but no update arrives. Sometimes it works properly.
Here are logs from my 3 attempts: first one was without updates, next two returned "request timeout" error.
enso-project-manager-2023-09-26.0.log