databrickslabs / ucx

Automated migrations to Unity Catalog
Other
217 stars 75 forks source link

[BUG]: UCX Assessment tasks are failing #2398

Closed tunayokumus closed 2 weeks ago

tunayokumus commented 1 month ago

Is there an existing issue for this?

Current Behavior

After installing the UCX to our Azure Databricks workspace the assessment job has been failing so far. At first it was only the crawl_tables task failing due to a Spark driver error. In the consequent runs more tasks started to fail. The task failures happen around 1-2 hour after the job is started.

Here are some of the error codes from the failed tasks:

crawl_tables: "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached. at com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:1478)"

crawl_groups: "com.databricks.backend.common.rpc.DriverStoppedException: Driver down cause: driver state change (exit code: 137)"

Expected Behavior

In another workspace the assessment job ran successfully without issues. We applied the same configuration to both workspaces when installing UCX.

Steps To Reproduce

No response

Cloud

Azure

Operating System

Linux

Version

latest via Databricks CLI

Relevant log output

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.
    at com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:1478)
    at com.databricks.spark.chauffeur.Chauffeur.$anonfun$driverStateOpt$1(Chauffeur.scala:187)
    at com.databricks.spark.chauffeur.Chauffeur.$anonfun$driverStateOpt$1$adapted(Chauffeur.scala:187)
    at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$goToStopped$4(DriverDaemonMonitorImpl.scala:251)
    at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$goToStopped$4$adapted(DriverDaemonMonitorImpl.scala:251)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.goToStopped(DriverDaemonMonitorImpl.scala:251)
    at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.monitorDriver(DriverDaemonMonitorImpl.scala:406)
    at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$job$1(DriverDaemonMonitorImpl.scala:100)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:532)
    at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:636)
    at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:654)
    at com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:48)
    at com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:253)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:249)
    at com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:46)
    at com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:43)
    at com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionContext(SingletonJob.scala:432)
    at com.databricks.logging.AttributionContextTracing.withAttributionTags(AttributionContextTracing.scala:95)
    at com.databricks.logging.AttributionContextTracing.withAttributionTags$(AttributionContextTracing.scala:76)
    at com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionTags(SingletonJob.scala:432)
    at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:631)
    at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:541)
    at com.databricks.threading.SingletonJob$SingletonJobImpl.recordOperationWithResultTags(SingletonJob.scala:432)
    at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:533)
    at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:501)
    at com.databricks.threading.SingletonJob$SingletonJobImpl.recordOperation(SingletonJob.scala:432)
    at com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$4(SingletonJob.scala:491)
    at com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:48)
    at com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:253)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:249)
    at com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:46)
    at com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:43)
    at com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionContext(SingletonJob.scala:432)
    at com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$3(SingletonJob.scala:491)
    at scala.util.Try$.apply(Try.scala:213)
    at com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$1(SingletonJob.scala:490)
    at com.databricks.util.UntrustedUtils$.tryLog(UntrustedUtils.scala:109)
    at com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.run(SingletonJob.scala:484)
    at com.databricks.threading.InstrumentedExecutorService$$anon$1.$anonfun$run$3(InstrumentedExecutorService.scala:144)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:48)
    at com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:253)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:249)
    at com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:46)
    at com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:43)
    at com.databricks.threading.InstrumentedExecutorService$$anon$1.withAttributionContext(InstrumentedExecutorService.scala:137)
    at com.databricks.threading.InstrumentedExecutorService$$anon$1.$anonfun$run$2(InstrumentedExecutorService.scala:142)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.instrumentation.QueuedThreadPoolInstrumenter.trackActiveThreads(QueuedThreadPoolInstrumenter.scala:110)
    at com.databricks.instrumentation.QueuedThreadPoolInstrumenter.trackActiveThreads$(QueuedThreadPoolInstrumenter.scala:107)
    at com.databricks.threading.InstrumentedExecutorService.trackActiveThreads(InstrumentedExecutorService.scala:40)
    at com.databricks.threading.InstrumentedExecutorService$$anon$1.$anonfun$run$1(InstrumentedExecutorService.scala:141)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.context.integrity.IntegrityCheckContext$ThreadLocalStorage$.withValue(IntegrityCheckContext.scala:73)
    at com.databricks.threading.InstrumentedExecutorService$$anon$1.run(InstrumentedExecutorService.scala:140)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
JCZuurmond commented 1 month ago

@tunayokumus : This is odd. Are there differences in network configuration between the workspaces? If so, what? Also, in the same workspace, could you compare the ucx job clustter configuration with the configuration of a (job) cluster that does not fail after two hours?

Finally, does the error still persist today?

HariGS-DB commented 3 weeks ago

@tunayokumus Few things to check on the job that is failing.