Unreliable Language Server behavior in the new GUI

farmaazon commented 1 year ago

The new GUI sends its request faster than the old one, and it triggers issues sometimes.

To reproduce:

Switch to wip/farmaazon/create-execution-ctx.
run npm install then npm --workspace=enso-gui2 run dev
run project manager (I tested on 09-26 nightly)
open chrome and go to http://localhost:5173/
open any project and check 127.0.0.1 connection in network tab

Sometimes the execution context creation returns "request timeout" error

{"jsonrpc":"2.0","id":"0","method":"session/initProtocolConnection","params":{"clientId":"6fbfded4-8884-45c5-849a-4694604ca355"}}   129 
13:48:13.194
{"jsonrpc":"2.0","id":"0","result":{"contentRoots":[{"type":"Project","id":"57592eda-e01f-493b-a864-e48aaa03a5fe"},{"type":"Home","id":"5e74e8d2-f18d-4e5a-acee-30b8f7567225"},{"type":"FileSystemRoot","id":"1d24271b-7eb7-4e66-b9b3-fb948f538a6c","path":"/"}]}}  258 
13:48:15.316
{"jsonrpc":"2.0","id":"1","method":"executionContext/create","params":{"contextId":"11cb9d39-d85e-49de-b401-c81624ca6f0b"}} 123 
13:48:15.316
{"jsonrpc":"2.0","id":"1","error":{"code":11,"message":"Request timeout"}}

Sometimes the RPC methods are successful, but no update arrives. Sometimes it works properly.

Here are logs from my 3 attempts: first one was without updates, next two returned "request timeout" error.

enso-project-manager-2023-09-26.0.log

farmaazon commented 1 year ago

It's make my recent testing quite hard, as I must try several times until I get updates (and reading updates I implement).

hubertp commented 1 year ago

This looks like a race-condition in the initialization and new GUI not waiting for the startup to complete?

[ERROR] [2023-09-26T14:41:03+02:00] [akka.actor.OneForOneStrategy] Cannot invoke "org.enso.interpreter.instrument.execution.CommandProcessor.invoke(org.enso.interpreter.instrument.command.Command)" because the return value of "org.enso.interpreter.instrument.Handler.commandProcessor()" is null
java.lang.NullPointerException: Cannot invoke "org.enso.interpreter.instrument.execution.CommandProcessor.invoke(org.enso.interpreter.instrument.command.Command)" because the return value of "org.enso.interpreter.instrument.Handler.commandProcessor()" is null
    at org.enso.interpreter.instrument.Handler.onMessage(Handler.scala:128)
    at org.enso.interpreter.instrument.Endpoint.$anonfun$sendBinary$1(Handler.scala:66)
    at org.enso.interpreter.instrument.Endpoint.$anonfun$sendBinary$1$adapted(Handler.scala:64)
    at scala.Option.foreach(Option.scala:437)
    at org.enso.interpreter.instrument.Endpoint.sendBinary(Handler.scala:64)
    at org.graalvm.truffle/com.oracle.truffle.api.instrumentation.TruffleInstrument$Env$MessageTransportProxy$MessageEndpointProxy.sendBinary(TruffleInstrument.java:1090)
    at org.enso.languageserver.runtime.RuntimeConnector$$anonfun$initialized$1.applyOrElse(RuntimeConnector.scala:79)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
    at org.enso.logger.akka.ActorLoggingReceive.apply(ActorLoggingReceive.scala:35)
    at org.enso.logger.akka.ActorLoggingReceive.apply(ActorLoggingReceive.scala:14)
    at scala.PartialFunction.applyOrElse(PartialFunction.scala:214)
    at scala.PartialFunction.applyOrElse$(PartialFunction.scala:213)
    at org.enso.logger.akka.ActorLoggingReceive.applyOrElse(ActorLoggingReceive.scala:14)
    at scala.PartialFunction$Combined.applyOrElse(PartialFunction.scala:305)
    at akka.actor.Actor.aroundReceive(Actor.scala:537)
    at akka.actor.Actor.aroundReceive$(Actor.scala:535)
    at org.enso.languageserver.runtime.RuntimeConnector.aroundReceive(RuntimeConnector.scala:20)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:579)
    at akka.actor.ActorCell.invoke(ActorCell.scala:547)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
    at akka.dispatch.Mailbox.run(Mailbox.scala:231)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)

farmaazon commented 1 year ago

And new GUI not waiting for the startup to complete?

There never was any message about "initialization completness" sent from engine, except first "execution status" (which we use to hide "compiling standard library") - but that requires creating execution context first, which often fails in GUI2.

hubertp commented 1 year ago

And new GUI not waiting for the startup to complete?

There never was any message about "initialization completness" sent from engine, except first "execution status" (which we use to hide "compiling standard library") - but that requires creating execution context first, which often fails in GUI2.

It's definitely related to that exception since in both cases that failed you were getting this NPE. I will dig into the sequence in more detail.

4e6 commented 1 year ago

There never was any message about "initialization completeness"

The response to session/initProtocolConnection should indicate that the language server is ready. We may need to wait for the Truffle context initialization before sending the response.

hubertp commented 1 year ago

I've been analyzing the logs again and something isn't right.

I can see 2 messages "RPC session initialized for client" for different clients in close succession.

Is that normal for the new IDE?
Isn't the new IDE trying to proceed with regular messages (open file etc) for the second client while only the first one finished the initialization and received the session/initProtocolConnection response?

farmaazon commented 1 year ago

Yes, it is. There are two clients, (we have a separate service which synchronizes y-js documents with the language sever),.
I doubt, they are separate processes. You see the copied network log from browser in the issue's description - there is a clear response to initProtocol before proceeding to creating execution context.

hubertp commented 1 year ago

Yes, it is. There are two clients, (we have a separate service which synchronizes y-js documents with the language sever),.

OK. Then I'm not surprised of this ticket anymore. In theory, we are able to support multiple clients, especially with collaborative buffer. In practice we probably never exercised initialization enough, nor needed to, to be confident that this will not cause problems.

enso-bot[bot] commented 1 year ago

Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-04):

Progress: Found the potential race condition when two clients initialize instrumentation. Not yet sure how to fix it as I'm not so familiar with that part. It should be finished by 2023-10-05.

Next Day: Next day I will be working on the #7898 task. Come up with a fix.

JaroslavTulach commented 1 year ago

This looks like a race-condition in the initialization and new GUI not waiting for the startup to complete?

[ERROR] [2023-09-26T14:41:03+02:00] [akka.actor.OneForOneStrategy] Cannot invoke "org.enso.interpreter.instrument.execution.CommandProcessor.invoke(org.enso.interpreter.instrument.command.Command)" because the return value of "org.enso.interpreter.instrument.Handler.commandProcessor()" is null
java.lang.NullPointerException: Cannot invoke "org.enso.interpreter.instrument.execution.CommandProcessor.invoke(org.enso.interpreter.instrument.command.Command)" because the return value of "org.enso.interpreter.instrument.Handler.commandProcessor()" is null
  at org.enso.interpreter.instrument.Handler.onMessage(Handler.scala:128)

Hubert has "walked me thru the code" at https://github.com/enso-org/enso/blob/f1356374e19878316d53a92b7880d3725d2f32f4/engine/runtime-instrument-common/src/main/scala/org/enso/interpreter/instrument/Handler.scala#L108 & co.

The value of commandProcessor is assigned on line 108 and only then the message InitializedNotification is sent. Still the field commandProcessor is null - this could happen because of violation of Java memory model.

Possibly, the set value of commandProcessor hasn't yet been stored into global memory when the message receiver tries to access it. Some for of synchronization, transient field or final field is needed.

hubertp commented 1 year ago

I feel like this more LS clients feature has opened a pandora's box. Here is another random failure:

[WARN] [2023-10-06T12:54:32+02:00] [enso] Execution of function main failed (Cannot invoke "com.oracle.truffle.api.source.SourceSection.getEndLine()" because the return value of "org.enso.interpreter.node.ClosureRootNode.getSourceSection()" is null).
java.lang.NullPointerException: Cannot invoke "com.oracle.truffle.api.source.SourceSection.getEndLine()" because the return value of "org.enso.interpreter.node.ClosureRootNode.getSourceSection()" is null
    at org.enso.interpreter.instrument.IdExecutionInstrument.bind(IdExecutionInstrument.java:255)
    at org.enso.interpreter.service.ExecutionService.lambda$execute$1(ExecutionService.java:170)
    at java.base/java.util.Optional.map(Optional.java:260)
    at org.enso.interpreter.service.ExecutionService.execute(ExecutionService.java:170)
    at org.enso.interpreter.service.ExecutionService.execute(ExecutionService.java:219)
    at org.enso.interpreter.instrument.job.ProgramExecutionSupport$.executeProgram(ProgramExecutionSupport.scala:111)
    at org.enso.interpreter.instrument.job.ProgramExecutionSupport$.$anonfun$runProgram$3(ProgramExecutionSupport.scala:220)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
    at cats.syntax.EitherObjectOps$.catchNonFatal$extension(either.scala:391)
    at org.enso.interpreter.instrument.job.ProgramExecutionSupport$.$anonfun$runProgram$2(ProgramExecutionSupport.scala:220)
    at scala.util.Either.flatMap(Either.scala:352)
    at org.enso.interpreter.instrument.job.ProgramExecutionSupport$.runProgram(ProgramExecutionSupport.scala:214)
    at org.enso.interpreter.instrument.job.ExecuteJob.run(ExecuteJob.scala:42)
    at org.enso.interpreter.instrument.job.ExecuteJob.run(ExecuteJob.scala:18)
    at org.enso.interpreter.instrument.execution.JobExecutionEngine.$anonfun$runInternal$1(JobExecutionEngine.scala:113)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
    at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotThread.access$001(PolyglotThread.java:53)

(once I fixed the previous problem)

enso-bot[bot] commented 1 year ago

Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-05):

Progress: Fixed two locations where we had potential race-condition. PR is up but needs more testing. I think I also saw some random failures still with new GUI. Trying to figure out what is with the failures on CI. It should be finished by 2023-10-05.

Next Day: Next day I will be working on the #7898 task. More testing with new GUI.

hubertp commented 1 year ago

Here are the relevant locks that are in the deadlock state:

2 compilation read locks
1 ensure compiled job that is holding a compilation write lock

Essentially the bottleneck is in the endpoint handling of messages which is synchronous.

deadlock.log

hubertp commented 1 year ago

Since it is probably not so obvious here is a sequence of events that triggers the problem. I'm pretty sure we might have experienced it occasionally with the old GUI except we never got a stracktrace and circumstances were less race-condition friendly:

PushContextCmd sends a message to Endpoint and starts a background EnsureCompiledJob
OpenFileCmd is started around the same time
EnsureCompiledJob acquires write lock
OpenFileCmd attempts to acquire read lock but waits on (3) to finish first to get the lock
EnsureCompiledJob in its implementation finds appropriate libraries and sends the message to endpoint
(5) waits infinitely for a reply for its message from endpoint
The reply will never arrive because endpoint is busy serving the OpenFileCmd which is stuck waiting on a lock
Deadlock

In fact, EnsureCompiledJob might not necessarily be started by PushContextCmd. There might be just one started in the background some time ago by one of the edits. Also, any command requesting read compilation lock can experience this problem e.g., OpenFileCmd could be replaced by CloseFileCmd.

enso-bot[bot] commented 1 year ago

Hubert Plociniczak reports a new 🔴 DELAY for the provided date (2023-10-06):

Summary: There is 6 days delay in implementation of the Unreliable Language Server behavior in the new GUI (#7898) task. It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: Fixing one problem in the initialization with multiple clients, revealed more bugs that were lurking around.

enso-bot[bot] commented 1 year ago

Hubert Plociniczak reports a new STANDUP for the provided date (2023-10-06):

Progress: Discovered the problem with messages not being unstashed while waiting for initialization. But the fix revealed more transient failures (less frequent than originally reported and at a later stage). Got distracted by investigating CI problems. In the end turned out to be a regression and added a revert in #7953 PR. It should be finished by 2023-10-11.

Next Day: Next day I will be working on the #7898 task. Discover the source of new timeouts in a PR for #7898.

enso-bot[bot] commented 1 year ago

Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-09):

Progress: The source of sporadic timeouts is a rare but still possible deadlock between two types of locks (context locks and lock manager). Details are in the ticket. Still unsure how to solve this properly. Will sync with Radek the next today since he wrote the the latter. It should be finished by 2023-10-11.

Next Day: Next day I will be working on the #7898 task. Find a fix for deadlock.

enso-bot[bot] commented 1 year ago

Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-10):

Progress: Added ordering for a selected number of commands, thus eliminating a deadlock and still maintaining a fix for previous problems. It should be finished by 2023-10-11.

Next Day: Next day I will be working on the #7898 task. More testing, address PR review.

enso-bot[bot] commented 1 year ago

Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-11):

Progress: The previous fix was unnecessarily over-engineered. Re-written it based on the feedback and now the fix is much easier to maintain and review. It should be finished by 2023-10-11.

Next Day: Next day I will be working on the #7978 task. See if my fix also fixed problems with edits people ocassionally see in the old GUI.

hubertp commented 1 year ago

Created a follow up ticket for suggestions' db problems https://github.com/enso-org/enso/issues/8033. Other than that I no longer see timeouts.

enso-org / enso

Unreliable Language Server behavior in the new GUI #7898