OpenLiberty / open-liberty

Open Liberty is a highly composable, fast to start, dynamic application server runtime environment
https://openliberty.io
Eclipse Public License 2.0
1.15k stars 587 forks source link

Quiesce operation does not complete (waitForStop method) #21734

Open ocseal opened 2 years ago

ocseal commented 2 years ago

Describe the bug
When running an open-source app with Open Liberty 22.0.0.7 without cleaning the target, error code CWWKE1102W (The quiesce operation did not complete) occurs when the server is stopped.

Steps to Reproduce

  1. Clone the repository from https://github.com/ocseal/cargotracker and navigate to the project directory.
  2. Run the app using 'mvn -P openliberty liberty:dev' and wait for the web application to become available.
  3. Access the web application and route the unrouted cargo object, then select the map from the administration interface.
  4. Stop the server using CTRL-C.
  5. Restart the server using 'mvn -P openliberty liberty:dev' without cleaning the target and wait for the web application be available.
  6. Stop the server using CTRL-C and observe the error.

Expected behavior
The server should shut down without any quiesce operation failures.

Diagnostic information:

Additional context
The relevant portion of the trace.log can be found by searching for "com.ibm.ws.runtime.update.internal.PauseableComponentQuiesceListener.serverStopping(PauseableComponentQuiesceListener.java:52)".

tkburroughs commented 2 years ago

javadump.20220712.114237.59014.0001.txt

tkburroughs commented 2 years ago

The following thread in the javadump shows which pausable component is preventing all other pausable components from being called:

"Default Executor-thread-33" #144 daemon prio=5 os_prio=31 cpu=11.64ms elapsed=58.19s tid=0x00007fcfb0bb5800 nid=0x17813 in Object.wait()  [0x00007000043fd000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(java.base@11.0.15/Native Method)
    - waiting on <no object reference available>
    at com.ibm.ws.http.internal.HttpChain$StopWait.waitForStop(HttpChain.java:953)
    - waiting to re-lock in wait() <0x000000060175cc58> (a com.ibm.ws.http.internal.HttpChain$StopWait)
    at com.ibm.ws.http.internal.HttpChain.stop(HttpChain.java:223)
    - locked <0x000000060175cc18> (a com.ibm.ws.http.internal.HttpChain)
    at com.ibm.ws.http.internal.HttpEndpointImpl$3.run(HttpEndpointImpl.java:187)
    - locked <0x000000060175cf18> (a com.ibm.ws.http.internal.HttpEndpointImpl$1)
    at com.ibm.ws.http.internal.HttpEndpointImpl.processHttpChainWork(HttpEndpointImpl.java:391)
    at com.ibm.ws.http.internal.HttpEndpointImpl.pause(HttpEndpointImpl.java:1070)
    at com.ibm.ws.runtime.update.internal.PauseableComponentQuiesceListener.serverStopping(PauseableComponentQuiesceListener.java:52)
    at com.ibm.ws.runtime.update.internal.RuntimeUpdateManagerImpl$3.run(RuntimeUpdateManagerImpl.java:380)
    at com.ibm.ws.threading.internal.ExecutorServiceImpl$RunnableWrapper.run(ExecutorServiceImpl.java:245)
    at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.15/Executors.java:515)
    at java.util.concurrent.FutureTask.run(java.base@11.0.15/FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.15/ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.15/ThreadPoolExecutor.java:628)
    at java.lang.Thread.run(java.base@11.0.15/Thread.java:829)

   Locked ownable synchronizers:
    - <0x0000000611dcd428> (a java.util.concurrent.ThreadPoolExecutor$Worker)

Either PauseableComponentQuiesceListener.serverStopping needs to submit all of the calls to pause on asynchronous threads, or com.ibm.ws.http.internal.HttpEndpointImpl.pause needs to do that.

This thread shows that com.ibm.ws.http.internal.HttpEndpointImpl.pause is calling com.ibm.ws.http.internal.HttpChain$StopWait.waitForStop(HttpChain.java:953) and waiting for 30 seconds before aborting.... which is also the server quiesce timeout.... thus server quiesce is already over by the time the wait aborts.

Seems that if HttpEndpointImpl.pause() is going to wait (and thus coudl be blocked), then it should spawn a thread to do that to allow other pausable components a chance to be called.

tkburroughs commented 2 years ago

@brenthdaniel Can you take a look at this and determine if it should be PausableComponent and determine the best action. Basically the call to HttpEndpointImpl.pause() hangs for 30 seconds, thus preventing other pause calls from occurring until after quiesce has completed, resulting in an error.... and of course the delay of work that could be done otherwise.

Seems either Pausable component shoudl spawn the work asynchronously or the http code should.

brenthdaniel commented 2 years ago

Hi, Pauseable components are expected to return in a reasonable amount of time. The idea is to cut off new requests coming into the system so that we can wait for existing work to complete. Someone running the 'server pause ' command would probably not expect the command to hang indefinitely.

It's not clear to me why in this scenario the chain would have problems shutting down, so maybe that's the first thing to fix here. Is the issue that some other PauseableComponent needs to run to unblock the chain? Or maybe some other component pausing is what is causing the chain to be blocked? In the latter case, running asynchronously would just move the race condition around.

We could perhaps have a design issue to see if we want to change the behavior of the quiesce to run PauseableComponents asynchronously to avoid any poorly behaving components, but I think we should understand what's going on here. In most cases, failing to close off an HTTP port is a legitimate quiesce failure.

tkburroughs commented 2 years ago

Thanks @brenthdaniel . Since the current design is that pauseable components should return in a reasonable amount of time, then I'll assign this issue to @mrsaldana to address the fact that HttpEndpointImpl.pause() is blocking and waiting for 30 seconds, which is the entire duration of server quiesce. Seems that HttpEndpointImpl.pause() should either be running HttpChain.stop() asynchronously or with a much smaller timeout, at least during server quiesce. Otherwise, the 30 second timeout defeats the purpose of the PauseableComponentQuiesceListener.

Not sure why HttpChain.stop is blocking for 30 seconds, but the side effect is that all other pauseable components that happen to be in the list after this http endpoint are not given the chance to pause until after server quiesce has completed.

tbitonti commented 6 months ago

Needs prioritization, suggested fix, and suggested assignment. There has been no meaningful activity since 25-Aug-2022.

tbitonti commented 6 months ago

Per Tracy:

When a request is made to shutdown the server, then during server quiesce, the kernel will synchronously notify the pausable components to pause. The dump showed that com.ibm.ws.http.internal.HttpEndpointImpl.pause was taking 30 seconds, thus not completing prior to server quiesce completing, preventing other pausable components from being notified.

This might be resolved by either:

  1. Change the kernel to notify the pausable components asynchronously, so they all had a chance to complete before server quiesce completed.
  2. Change com.ibm.ws.http.internal.HttpEndpointImpl.pause to not wait for 30 seconds. If it wants to wait, it should spawn a thread to do the work and return so the kernel could continue to notify the other pausable components.

Brent Danial suggested that the intent of he kernel was that calling pause was not intended to take long, so the kernel should not need to update, but instead any pausable component that was taking too long should be fixed.

tbitonti commented 6 months ago

Will review with the Transport squad for prioritization.