apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.1k stars 915 forks source link

[Improvement] Support to show more engine submission failure error for issue tracking #5304

Open beryllw opened 1 year ago

beryllw commented 1 year ago

Code of Conduct

Search before asking

What would you like to be improved?

Current Kyuubi can recognize the exception information when starting the Engine and return the last Error. However, sometimes the last exception is not the root cause of the Engine start-up. Usually, we need multiple exceptions to determine the fundamental reason for the failure of the engine's launch. Is it possible to return the last several exceptions to better support issue tracking when the engine fails to start? This is an error log from a Spark engine start-up. Kyuubi recognized the exception but was unable to capture the fundamental cause of the engine's failure to start.

Caused by: java.lang.RuntimeException: org.apache.kyuubi.KyuubiSQLException:Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
 See more: /xxxxxxxxx/kyuubi-spark-sql-engine.log.52
    at org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:69)
    at org.apache.kyuubi.engine.ProcBuilder.$anonfun$start$1(ProcBuilder.scala:142)
    ... 1 more

The root cause log of failure is as follows:

23/09/18 09:21:57 INFO SparkContext: Successfully stopped SparkContext
23/09/18 09:21:57 ERROR SparkSQLEngine: Failed to instantiate SparkSession: requirement failed: initial executor number 501 must between min executor number 0 and max executor number 500
java.lang.IllegalArgumentException: requirement failed: initial executor number 501 must between min executor number 0 and max executor number 500

How should we improve?

We could enhance our issue tracking process by defaulting to return multiple exceptions and including the appId in our log entries.

Are you willing to submit PR?

github-actions[bot] commented 1 year ago

Hello @Kwafoor, Thanks for finding the time to report the issue! We really appreciate the community's efforts to improve Apache Kyuubi.

pan3793 commented 1 year ago

Specific to engine bootstrap, have you tried to use Kyuubi Beeline? It can retrieve all logs during the engine bootstrap phase.

beryllw commented 1 year ago

Specific to engine bootstrap, have you tried to use Kyuubi Beeline? It can retrieve all logs during the engine bootstrap phase.

Is there any specific setting that needs to be added? Under default settings, the error reported is basically the same when I test the same failure reason using beeline.

Error opening session for xxx client ip xxxx, due to org.apache.kyuubi.KyuubiSQLException: Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
 See more: /xxxxxxx/kyuubi-spark-sql-engine.log.273 Retrying 0 of 1

And related code is in: https://github.com/apache/kyuubi/blob/master/kyuubi-server/src/main/scala/org/apache/kyuubi/engine/ProcBuilder.scala.

pan3793 commented 1 year ago

Emm, seems we are talking different things, I know what you mean now, and I agree with you that there is room to improve the error message extraction. But maybe we can improve the extracting rule rather than roughly extracting more stacktrace

beryllw commented 1 year ago

Roughly extracting more stacktrace is relatively easy to implement. As for improving the extraction rules, I don't have any ideas at the moment. Do you have any insights to share?

pan3793 commented 1 year ago

Maybe we can recognize and eliminate some helpless tailing stacktrace, the root cause could be exposed then.

beryllw commented 1 year ago

Maybe we can recognize and eliminate some helpless tailing stacktrace, the root cause could be exposed then.

Eliminate helpless tailing stacktrace is a good idea, I will give it a try.

beryllw commented 1 year ago

Eliminate helpless tailing stacktrace is difficult. The exception stack from ProcBuilder's stderr isn't being captured. The matching rule for the exception stack is \tat, but when stderr outputs the exception, the prefix of the stack is at. The problem I'm facing could be solved by capturing this part of the exception stack.

The part of the log that matches the exception is missing the exception stack.

Caused by: java.lang.RuntimeException: org.apache.kyuubi.KyuubiSQLException:Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
 See more: /xxxxxxxxx/kyuubi-spark-sql-engine.log.xxx
    at org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:69)
    at org.apache.kyuubi.engine.ProcBuilder.$anonfun$start$1(ProcBuilder.scala:142)
    ... 1 more

Log containing the exception stack.

For more detailed output, check the application tracking page: http://xxxxxxx Then click on links to logs of each attempt.
. Failing the application.
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1780)
...
Caused by: org.apache.spark.SparkException: Application xxxxxxxx finished with failed status
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1249)
...

image

@pan3793 cc