C360 and IoT Turbine demos failing in DLT pipeline step

mik3lol commented 5 months ago

Both demos failed with RUN_EXECUTION_ERROR: Workload failed in one of the DLT pipeline steps, due to OSError: No such file or directory: '/local_disk0/.ephemeral_nfs/repl_tmp_data/ReplId-768fe-95041-c4868-1/mlflow/models/tmp_3o1_40h/.'

Full stack trace below:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 871.0 failed 4 times, most recent failure: Lost task 0.3 in stage 871.0 (TID 950) (10.0.36.202 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-cbf62aec-d47b-48f3-a264-73a49ce0c0cf/lib/python3.10/site-packages/mlflow/pyfunc/init.py", line 1275, in udf loaded_model = mlflow.pyfunc.load_model(local_model_path) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-cbf62aec-d47b-48f3-a264-73a49ce0c0cf/lib/python3.10/site-packages/mlflow/pyfunc/init.py", line 578, in load_model local_path = _download_artifact_from_uri(artifact_uri=model_uri, output_path=dst_path) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-cbf62aec-d47b-48f3-a264-73a49ce0c0cf/lib/python3.10/site-packages/mlflow/tracking/artifact_utils.py", line 100, in _download_artifact_from_uri return get_artifact_repository(artifact_uri=root_uri).download_artifacts( File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-cbf62aec-d47b-48f3-a264-73a49ce0c0cf/lib/python3.10/site-packages/mlflow/store/artifact/local_artifact_repo.py", line 81, in download_artifacts raise OSError(f"No such file or directory: '{local_artifact_path}'") OSError: No such file or directory: '/local_disk0/.ephemeral_nfs/repl_tmp_data/ReplId-768fe-95041-c4868-1/mlflow/models/tmp_3o1_40h/.'

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:550)
at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:117)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:506)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:195)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:92)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:87)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:39)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:186)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:151)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:45)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:103)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:108)
at scala.util.Using$.resource(Using.scala:269)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:107)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:145)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:958)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:105)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:961)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:853)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3897) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3819) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3806) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3806) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1685) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1670) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1670) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:4143) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:4055) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:4043) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:54) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-cbf62aec-d47b-48f3-a264-73a49ce0c0cf/lib/python3.10/site-packages/mlflow/pyfunc/init.py", line 1275, in udf loaded_model = mlflow.pyfunc.load_model(local_model_path) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-cbf62aec-d47b-48f3-a264-73a49ce0c0cf/lib/python3.10/site-packages/mlflow/pyfunc/init.py", line 578, in load_model local_path = _download_artifact_from_uri(artifact_uri=model_uri, output_path=dst_path) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-cbf62aec-d47b-48f3-a264-73a49ce0c0cf/lib/python3.10/site-packages/mlflow/tracking/artifact_utils.py", line 100, in _download_artifact_from_uri return get_artifact_repository(artifact_uri=root_uri).download_artifacts( File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-cbf62aec-d47b-48f3-a264-73a49ce0c0cf/lib/python3.10/site-packages/mlflow/store/artifact/local_artifact_repo.py", line 81, in download_artifacts raise OSError(f"No such file or directory: '{local_artifact_path}'") OSError: No such file or directory: '/local_disk0/.ephemeral_nfs/repl_tmp_data/ReplId-768fe-95041-c4868-1/mlflow/models/tmp_3o1_40h/.'

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:550) at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:117) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:506) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(null:-1) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:195) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:56) at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:92) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:87) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:39) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:186) at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:151) at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:45) at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:103) at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:108) at scala.util.Using$.resource(Using.scala:269) at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:107) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:145) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:958) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:105) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:961) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:853) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)

QuentinAmbard commented 5 months ago

Hey, there is an issue with mlflow in DLT that we're working on fixing, you should be able to make it work installing this at the beginning of the DLT notebook:

%pip install git+https://github.com/WeichenXu123/mlflow.git@dlt-temp-fix

mik3lol commented 5 months ago

Thanks @QuentinAmbard I'll give it a try and report back.

mik3lol commented 5 months ago

Added the %pip install line on top of "01.1-DLT-Wind-Turbine-SQL" but still got the same error. Will continue to check.

QuentinAmbard commented 5 months ago

@mik3lol could you try changing the DLT channel to CURRENT in the DLT setup and see if it helps?

WeichenXu123 commented 5 months ago

@mik3lol

Could you paste your error message after installing %pip install git+https://github.com/WeichenXu123/mlflow.git@dlt-temp-fix ?

I need to check the full error message string "OSError: No such file or directory: {directory path}"

because @dlt-temp-fix branch uses another directory path, I need to check if it really took effect.

and could you share me your DLT pipeline link to me ?

mechevarria commented 5 months ago

@QuentinAmbard just verified that both the C360 and IoT demo DLT tables are working when using the CURRENT channel

mik3lol commented 5 months ago

👋 @QuentinAmbard, confirming default dbdemo FSI Smart Claims installations ran successfully. Trying others now.

databricks-demos / dbdemos

C360 and IoT Turbine demos failing in DLT pipeline step #126