Closed soxofaan closed 2 years ago
Possible solutions:
\n
can be used for encoding newlines multi-line messages and that subsequent indentation should be preserved when rendering the message in the clientThis might be more a client issue, because the API only says "string" which implicitly allows new-lines, I'd say. On the other hand, I did not expect people to put exessively long stacktraces into the message. What would be more user-friedly is to only pass the actual error message to message and pass the stack-trace e.g. to "data", that would massively improve the user experience. We could probably add a "formatting rule" for stack-traces in the implementation guide, similar to how we have it for inspect. The log component can already offer much more if fed correctly: https://open-eo.github.io/openeo-vue-components/ (see logs -> example ...)
@soxofaan Could you copy me an example for such an error in Python? So that I can see how it is structured and how we could format it? I think we can already make this much more pleasing without a lot of rewriting.
Thinking about something like:
{
"id": "132",
"level": "error",
"message": "error processing batch job due to ...",
"data": {
"type": "Stacktrace",
"stacktrace": [
{"file": "batch_job.py", "line": 319, "text": "in main"},
{"file": "batch_job.py", "line": 292, "text": "in run_driver"},
...
]
}
}
Something like that should already render much nicer in the component.
This is an example /logs
response JSON dump from VITO backend:
{"logs":[{"id": "error", "level": "error", "message": "error processing batch job\nTraceback (most recent call last):\n File \"batch_job.py\", line 319, in main\n run_driver()\n File \"batch_job.py\", line 292, in run_driver\n run_job(\n File \"/data2/hadoop/yarn/local/usercache/johndoe/appcache/application_1652795411773_14281/container_e5028_1652795411773_14281_01_000002/venv/lib/python3.8/site-packages/openeogeotrellis/utils.py\", line 41, in memory_logging_wrapper\n return function(*args, **kwargs)\n File \"batch_job.py\", line 388, in run_job\n assets_metadata = result.write_assets(str(output_file))\n File \"/data2/hadoop/yarn/local/usercache/johndoe/appcache/application_1652795411773_14281/container_e5028_1652795411773_14281_01_000002/venv/lib/python3.8/site-packages/openeo_driver/save_result.py\", line 110, in write_assets\n return self.cube.write_assets(filename=directory, format=self.format, format_options=self.options)\n File \"/data2/hadoop/yarn/local/usercache/johndoe/appcache/application_1652795411773_14281/container_e5028_1652795411773_14281_01_000002/venv/lib/python3.8/site-packages/openeogeotrellis/geopysparkdatacube.py\", line 1542, in write_assets\n timestamped_paths = self._get_jvm().org.openeo.geotrellis.geotiff.package.saveRDDTemporal(\n File \"/opt/spark3_2_0/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py\", line 1309, in __call__\n return_value = get_return_value(\n File \"/opt/spark3_2_0/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py\", line 326, in get_return_value\n raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while calling z:org.openeo.geotrellis.geotiff.package.saveRDDTemporal.\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3652 in stage 14.0 failed 4 times, most recent failure: Lost task 3652.3 in stage 14.0 (TID 3949) (epod130.vgt.vito.be executor 37): net.jodah.failsafe.FailsafeException: java.net.SocketTimeoutException: connect timed out\n\tat net.jodah.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:385)\n\tat net.jodah.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:68)\n\tat org.openeo.geotrellissentinelhub.package$.withRetries(package.scala:59)\n\tat org.openeo.geotrellissentinelhub.DefaultProcessApi.getTile(ProcessApi.scala:119)\n\tat org.openeo.geotrellissentinelhub.PyramidFactory.$anonfun$datacube_seq$1(PyramidFactory.scala:193)\n\tat org.openeo.geotrellissentinelhub.MemoizedRlGuardAdapterCachedAccessTokenWithAuthApiFallbackAuthorizer.authorized(Authorizer.scala:46)\n\tat org.openeo.geotrellissentinelhub.PyramidFactory.authorized(PyramidFactory.scala:56)\n\tat org.openeo.geotrellissentinelhub.PyramidFactory.org$openeo$geotrellissentinelhub$PyramidFactory$$getTile$1(PyramidFactory.scala:191)\n\tat org.openeo.geotrellissentinelhub.PyramidFactory.org$openeo$geotrellissentinelhub$PyramidFactory$$dataTile$1(PyramidFactory.scala:201)\n\tat org.openeo.geotrellissentinelhub.PyramidFactory.loadMasked$1(PyramidFactory.scala:226)\n\tat org.openeo.geotrellissentinelhub.PyramidFactory.$anonfun$datacube_seq$16(PyramidFactory.scala:283)\n\tat scala.collection.Iterator$$anon$10.next(Iterator.scala:459)\n\tat scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:512)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)\n\tat scala.collection.Iterator.foreach(Iterator.scala:941)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:941)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1429)\n\tat org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)\n\tat org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:670)\n\tat org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:424)\n\tat org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)\n\tat org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:259)\nCaused by: java.net.SocketTimeoutException: connect timed out\n\tat java.base/java.net.PlainSocketImpl.socketConnect(Native Method)\n\tat java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)\n\tat java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)\n\tat java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)\n\tat java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)\n\tat java.base/java.net.Socket.connect(Socket.java:609)\n\tat java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:300)\n\tat java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177)\n\tat java.base/sun.net.[www.http.HttpClient.openServer(HttpClient.java:474)\n\tat](https://www.http.HttpClient.openServer(HttpClient.java:474)\n\tat) java.base/sun.net.[www.http.HttpClient.openServer(HttpClient.java:569)\n\tat](https://www.http.HttpClient.openServer(HttpClient.java:569)\n\tat) java.base/sun.net.[www.protocol.https.HttpsClient.<init>(HttpsClient.java:266)\n\tat](https://www.protocol.https.HttpsClient.<init>(HttpsClient.java:266)\n\tat) java.base/sun.net.[www.protocol.https.HttpsClient.New(HttpsClient.java:373)\n\tat](https://www.protocol.https.HttpsClient.New(HttpsClient.java:373)\n\tat) java.base/sun.net.[www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:203)\n\tat](https://www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:203)\n\tat) java.base/sun.net.[www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)\n\tat](https://www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)\n\tat) java.base/sun.net.[www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)\n\tat](https://www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)\n\tat) java.base/sun.net.[www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:189)\n\tat](https://www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:189)\n\tat) java.base/sun.net.[www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1592)\n\tat](https://www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1592)\n\tat) java.base/sun.net.[www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)\n\tat](https://www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)\n\tat) java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527)\n\tat java.base/sun.net.[www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:334)\n\tat](https://www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:334)\n\tat) scalaj.http.HttpRequest.doConnection(Http.scala:367)\n\tat scalaj.http.HttpRequest.exec(Http.scala:343)\n\tat org.openeo.geotrellissentinelhub.DefaultProcessApi.$anonfun$getTile$7(ProcessApi.scala:120)\n\tat org.openeo.geotrellissentinelhub.package$$anon$1.get(package.scala:60)\n\tat net.jodah.failsafe.Functions.lambda$get$0(Functions.java:46)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:65)\n\tat net.jodah.failsafe.Execution.executeSync(Execution.java:128)\n\tat net.jodah.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:378)\n\t... 24 more\n\nDriver stacktrace:\n\tat org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2403)\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2352)\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2351)\n\tat scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)\n\tat scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)\n\tat scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)\n\tat org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2351)\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1109)\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1109)\n\tat scala.Option.foreach(Option.scala:407)\n\tat org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1109)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2591)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2533)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2522)\n\tat org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)\n\tat org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:898)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2254)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2279)\n\tat org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n\tat org.apache.spark.rdd.RDD.withScope(RDD.scala:414)\n\tat org.apache.spark.rdd.RDD.collect(RDD.scala:1029)\n\tat org.openeo.geotrellis.geotiff.package$.saveRDDTemporal(package.scala:136)\n\tat org.openeo.geotrellis.geotiff.package.saveRDDTemporal(package.scala)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)\n\tat py4j.ClientServerConnection.run(ClientServerConnection.java:106)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: net.jodah.failsafe.FailsafeException: java.net.SocketTimeoutException: connect timed out\n\tat net.jodah.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:385)\n\tat net.jodah.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:68)\n\tat org.openeo.geotrellissentinelhub.package$.withRetries(package.scala:59)\n\tat org.openeo.geotrellissentinelhub.DefaultProcessApi.getTile(ProcessApi.scala:119)\n\tat org.openeo.geotrellissentinelhub.PyramidFactory.$anonfun$datacube_seq$1(PyramidFactory.scala:193)\n\tat org.openeo.geotrellissentinelhub.MemoizedRlGuardAdapterCachedAccessTokenWithAuthApiFallbackAuthorizer.authorized(Authorizer.scala:46)\n\tat org.openeo.geotrellissentinelhub.PyramidFactory.authorized(PyramidFactory.scala:56)\n\tat org.openeo.geotrellissentinelhub.PyramidFactory.org$openeo$geotrellissentinelhub$PyramidFactory$$getTile$1(PyramidFactory.scala:191)\n\tat org.openeo.geotrellissentinelhub.PyramidFactory.org$openeo$geotrellissentinelhub$PyramidFactory$$dataTile$1(PyramidFactory.scala:201)\n\tat org.openeo.geotrellissentinelhub.PyramidFactory.loadMasked$1(PyramidFactory.scala:226)\n\tat org.openeo.geotrellissentinelhub.PyramidFactory.$anonfun$datacube_seq$16(PyramidFactory.scala:283)\n\tat scala.collection.Iterator$$anon$10.next(Iterator.scala:459)\n\tat scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:512)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)\n\tat scala.collection.Iterator.foreach(Iterator.scala:941)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:941)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1429)\n\tat org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)\n\tat org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:670)\n\tat org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:424)\n\tat org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)\n\tat org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:259)\nCaused by: java.net.SocketTimeoutException: connect timed out\n\tat java.base/java.net.PlainSocketImpl.socketConnect(Native Method)\n\tat java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)\n\tat java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)\n\tat java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)\n\tat java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)\n\tat java.base/java.net.Socket.connect(Socket.java:609)\n\tat java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:300)\n\tat java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177)\n\tat java.base/sun.net.[www.http.HttpClient.openServer(HttpClient.java:474)\n\tat](https://www.http.HttpClient.openServer(HttpClient.java:474)\n\tat) java.base/sun.net.[www.http.HttpClient.openServer(HttpClient.java:569)\n\tat](https://www.http.HttpClient.openServer(HttpClient.java:569)\n\tat) java.base/sun.net.[www.protocol.https.HttpsClient.<init>(HttpsClient.java:266)\n\tat](https://www.protocol.https.HttpsClient.<init>(HttpsClient.java:266)\n\tat) java.base/sun.net.[www.protocol.https.HttpsClient.New(HttpsClient.java:373)\n\tat](https://www.protocol.https.HttpsClient.New(HttpsClient.java:373)\n\tat) java.base/sun.net.[www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:203)\n\tat](https://www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:203)\n\tat) java.base/sun.net.[www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)\n\tat](https://www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)\n\tat) java.base/sun.net.[www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)\n\tat](https://www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)\n\tat) java.base/sun.net.[www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:189)\n\tat](https://www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:189)\n\tat) java.base/sun.net.[www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1592)\n\tat](https://www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1592)\n\tat) java.base/sun.net.[www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)\n\tat](https://www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)\n\tat) java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527)\n\tat java.base/sun.net.[www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:334)\n\tat](https://www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:334)\n\tat) scalaj.http.HttpRequest.doConnection(Http.scala:367)\n\tat scalaj.http.HttpRequest.exec(Http.scala:343)\n\tat org.openeo.geotrellissentinelhub.DefaultProcessApi.$anonfun$getTile$7(ProcessApi.scala:120)\n\tat org.openeo.geotrellissentinelhub.package$$anon$1.get(package.scala:60)\n\tat net.jodah.failsafe.Functions.lambda$get$0(Functions.java:46)\n\tat net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:65)\n\tat net.jodah.failsafe.Execution.executeSync(Execution.java:128)\n\tat net.jodah.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:378)\n\t... 24 more\n\n"}],"links":[]}
Also note that because of our processing stack, this is pretty complex stack trace: part of it is a python stack trace, and part of it is Java/Scala stack trace, both of which can have multiple phases (e.g. an exception handler that raises another exception):
Traceback (most recent call last):
File "batch_job.py", line 319, in main
run_driver()
File "batch_job.py", line 292, in run_driver
run_job(
...
File "/opt/spark3_2_0/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.openeo.geotrellis.geotiff.package.saveRDDTemporal.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3652 in stage 14.0 failed 4 times, most recent failure: Lost task 3652.3 in stage 14.0 (TID 3949) (epod130.vgt.vito.be executor 37): net.jodah.failsafe.FailsafeException: java.net.SocketTimeoutException: connect timed out
at net.jodah.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:385)
at net.jodah.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:68)
...
Caused by: java.net.SocketTimeoutException: connect timed out
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
...
... 24 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2403)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2352)
...
Caused by: net.jodah.failsafe.FailsafeException: java.net.SocketTimeoutException: connect timed out
at net.jodah.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:385)
That makes it pretty hard to come up with some useful "type": "Stacktrace"
standardization.
It would already help to extract the actual error message and put the stacktrace in an array/object style structure so that the component can render it in a more structured way.
When a batch job fails in the VITO back-end, the error logs typically contains a multi-line stack trace. We currently encode the newlines (JSON style) as
\n
in the message string. This\n
is currently an ad-hoc solution in terms of the openEO API, because it does not specify how things (like multi-line log messages) should be encoded. Can we standardize this in some way, so that all clients can build on this?E.g. the web editor component collapses all whitespace (newlines and indentation), which makes user support painful. And in the python client you also have to be careful to get a useful render of the log message.