apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.4k stars 1.26k forks source link

[multistage] QueryRunnerTest tear down takes 10sec (default timeout) #11072

Open walterddr opened 1 year ago

walterddr commented 1 year ago

this very likely indicates some type of resource leak when issuing purposefully malicious queries, which is some of the test cases in this class do

need investigate on whether this is a test setup issue or an actual resource leak in main/prod code.

Aravind-Suresh commented 1 year ago

I tried running this test locally and added a few debugger breakpoints here and there - looks like the cleanup / shutdown is taking a long time.

      at org.apache.pinot.query.mailbox.channel.GrpcMailboxServer.shutdown(GrpcMailboxServer.java:63)
      at org.apache.pinot.query.mailbox.MailboxService.shutdown(MailboxService.java:118)
      at org.apache.pinot.query.runtime.QueryRunner.shutDown(QueryRunner.java:132)
      at org.apache.pinot.query.QueryServerEnclosure.shutDown(QueryServerEnclosure.java:126)
      at org.apache.pinot.query.runtime.QueryRunnerTest.tearDown(QueryRunnerTest.java:156)
      at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-1)
      at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:566)
      at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:108)
      at org.testng.internal.Invoker.invokeConfigurationMethod(Invoker.java:523)
      at org.testng.internal.Invoker.invokeConfigurations(Invoker.java:224)
      at org.testng.internal.Invoker.invokeConfigurations(Invoker.java:146)
      at org.testng.internal.TestMethodWorker.invokeAfterClassMethods(TestMethodWorker.java:212)
      at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
      at org.testng.TestRunner.privateRun(TestRunner.java:744)
      at org.testng.TestRunner.run(TestRunner.java:602)
      at org.testng.SuiteRunner.runTest(SuiteRunner.java:380)
      at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:375)
      at org.testng.SuiteRunner.privateRun(SuiteRunner.java:340)
      at org.testng.SuiteRunner.run(SuiteRunner.java:289)
      at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
      at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
      at org.testng.TestNG.runSuitesSequentially(TestNG.java:1301)
      at org.testng.TestNG.runSuitesLocally(TestNG.java:1226)
      at org.testng.TestNG.runSuites(TestNG.java:1144)
      at org.testng.TestNG.run(TestNG.java:1115)
      at com.intellij.rt.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:66)
      at com.intellij.rt.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:109)

The default awaitTermination timeout here is 10 seconds (which kinda justifies why tearDown takes a little more than 10 seconds to complete).

If we remove the timeout, it takes around 15 seconds to complete.

If we do not awaitTermination, then tearDown completes in around 200ms - but cleanup wouldn't be done in all cases.

Will debug further to see what's causing this delay inside shutdown().

walterddr commented 1 year ago

yes we should do a framegraph dump and see what's the time spent during the tearDown() process (e.g. is it on server executor or mailbox service or the scheduler async wait. once we ident we should be able to figure out which step is stuck cleaning stuff up.

The fact that it only happens to the QueryRunnerTest but not the ResourceBasedQueriesTest is also interesting to see which query actually is causing the tear down to be slow