Invalid state error for long (and frequent) requests

p-pavel commented 3 years ago

I've made some relatively long running APIFunction:

Export["~/Projects/webengine/http.wl", 
 APIFunction[{"a" -> "Integer"}, 
  ExportForm[
    Graphics3D[Translate[Sphere[], RandomReal[{-20, 20}, {#a, 3}]]], 
    "JPEG"] &]]

This runs ok:

GET /http.wl done in 0.1067s: 200

But if you press Ctrl/Cmd - R in browser (or use some performance testing tool like ab) there're state errors in wolframwebengine which effectively bring the server to its knees:

 GET /http.wl done in 0.1071s: 200
GET /http.wl done in 0.1078s: 200
Start termination on kernel 
Killing kernel process: 8634
Exception in thread wolfram-kernel-1:
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.9/3.9.1_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.9/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py", line 599, in run
Kernel writes commands to socket: 
Kernel receives evaluated expressions from socket: 
    raise e
  File "/usr/local/lib/python3.9/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py", line 583, in run
    self._do_evaluate(payload, future, result_update_callback)
  File "/usr/local/lib/python3.9/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py", line 547, in _do_evaluate
Kernel process started with PID: 8645
    future.set_result(result)
  File "/usr/local/Cellar/python@3.9/3.9.1_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 525, in set_result
    raise InvalidStateError('{}: {!r}'.format(self._state, self))
concurrent.futures._base.InvalidStateError: CANCELLED: 
Kernel 8645 is ready. Startup took 1.82 seconds.
Start termination on kernel 
Killing kernel process: 8645
Exception in thread wolfram-kernel-2:
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.9/3.9.1_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.9/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py", line 599, in run
    raise e
  File "/usr/local/lib/python3.9/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py", line 558, in run
    future.set_result(True)
  File "/usr/local/Cellar/python@3.9/3.9.1_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 525, in set_result
    raise InvalidStateError('{}: {!r}'.format(self._state, self))
concurrent.futures._base.InvalidStateError: CANCELLED: 
Kernel writes commands to socket: 
Kernel receives evaluated expressions from socket: 
Kernel process started with PID: 8646
Kernel 8646 is ready. Startup took 1.81 seconds.

riccardodivirgilio commented 3 years ago

This seems to be a bug. It is easier to reproduce if you use Delayed[Pause[2]] as expression and keep refreshing the browser. What seems to be happening is that when a request is aborted (because the browser is closed in the middle of a kernel evaluation) the whole session is terminated and the kernel is restarted.

riccardodivirgilio commented 3 years ago

It will be addressed, for now I would suggest you to use multiple kernels. Thanks!

p-pavel commented 3 years ago

I believe it may be a bug in wolframclient, not wolframwebengine

multiple kernels do not help much btw :( Specifying —poolsize 20 actually make things worse as multiple kernels keep restarting.

riccardodivirgilio commented 3 years ago

kernels will be restarting, but every request come to a new kernel, so user experience should be a little better because they should not experience the slowdown of aborted kernels (unless every single request is aborted).

p-pavel commented 3 years ago

kernels will be restarting, but every request come to a new kernel, so user experience should be a little better because they should not experience the slowdown of aborted kernels (unless every single request is aborted).

I see. I mean the whole server will be stuck.

Thank you for the quick reply and diagnostics!

Sorry I can't participate and come out with pull request — don't know enough of Python

p-pavel commented 3 years ago

@riccardodivirgilio I wonder what the correct behaviour could look in this case? I suppose cancel should be forwarded to the kernel somehow but wolfram client library does not seem to support anything like "preemptive" evaluation to, say, abort the computation, right?

riccardodivirgilio commented 3 years ago

Not sure what exactly is happening yet, because we still need to debug this issue. Kernels are single threaded, so every request needs to wait for the first available kernel. By doing a preliminary look the bug here seems to be that instead of aborting only the current evaluation we are aborting the the whole session and quitting the kernel.

p-pavel commented 3 years ago

Kernels are single threaded, so every request needs to wait for the first available kernel. By doing a preliminary look the bug here seems to be that instead of aborting only the current evaluation we are aborting the the whole session and quitting the kernel.

This was the question: is aborting the current evaluation only at all possible?

p-pavel commented 2 years ago

Are there any news on this?

platomaniac commented 2 years ago

This problem occurs with demo applications too if one even press the submit button in the form pages little frequently. Solving this bug gracefully will be great as web apps and API are inherently supposed to handle frequent requests.

riccardodivirgilio commented 2 years ago

Sorry for late reply on this thread, but I wanted to give an official response for this ticket. Unfortunately this bug is making the webserver not suitable for production, and we won't be able to fix it because in my opinion it would need a complete rewrite (in short we would need to move away from ZMQ based communication and use WSTP instead). For this reason, WolframEngineForPython will remain a development only server, and we will be updating the documentation to reflect that.

We strongly suggest to run a different Wolfram product on a production environment:

https://www.wolfram.com/server-deployment-options/

If you are able to find a workaround for this issue, please don't esitate to submit a pull request, I will be more than glad to review it.

Otherwise here are some suggestions you can try that could make this webserver more usable:

add a caching layer in front of the webserver, by using something like nginx you should be able to strongly reduce the amount of work on the kernel.
wrap all your APIFunction body in a TimeConstrained[..., 0.5], try to keep all your evaluations shorts and perform long running tasks in a different manner.

Thanks. Riccardo Di Virgilio Wolfram Research Inc

p-pavel commented 2 years ago

Thanks @riccardodivirgilio for the sincere reply.

This is what I was afraid of from the start of the thread.

Unfortunately I see no way to make this working using current ZMQ approach :( All possible solutions I can think of require going deeper into layers of abstraction and I'm not that sure that WSTP will do it reasonably well either :(

RIP

WolframResearch / WolframWebEngineForPython

Invalid state error for long (and frequent) requests #8