databrickslabs / ucx

Automated migrations to Unity Catalog
Other
233 stars 80 forks source link

[BUG]: migrate-tables fails to migrate table with a decimal as a column name #2544

Closed maruppel closed 1 month ago

maruppel commented 2 months ago

Is there an existing issue for this?

Current Behavior

migrate-tables unable to migrate table with decimals as a column name, i.e. 10.0.

Expected Behavior

Tables are migrated with non-standard column names wrapped in quotes.

Steps To Reproduce

No response

Cloud

Azure

Operating System

Windows

Version

latest via Databricks CLI

Relevant log output

Failed to migrate table hive_metastore.[schema].[table] to [catalog].[schema].[table]: INVALID_PARAMETER_VALUE: Invalid input: RPC CreateTable Field managedcatalog.ColumnInfo.name: At columns.0: name "" is not a valid name
JCZuurmond commented 2 months ago

@maruppel : Thank you for opening the issue. To make sure that we cover your issue, could you specify what kind of table you try to migrate:

I expect it to be a "table in mount" as that is the only place where we specify the schema explicitly, the others use a "SELECT * FROM ..."

maruppel commented 2 months ago

All the tables with the error are dbfs root tables.

maruppel commented 1 month ago

Seems like the issue persists, see log running version 0.37.0. ` details = "INVALID_PARAMETER_VALUE: Invalid input: RPC CreateTable Field managedcatalog.ColumnInfo.name: At columns.0: name "" is not a valid name" debug_error_string = "UNKNOWN:Error received from peer unix:/databricks/sparkconnect/grpc.sock {grpc_message:"INVALID_PARAMETER_VALUE: Invalid input: RPC CreateTable Field managedcatalog.ColumnInfo.name: At columns.0: name \"\" is not a valid name", grpc_status:13, created_time:"2024-09-24T15:12:37.144341818+00:00"}"

15:12:37 WARN [d.l.u.hive_metastore.table_migrate][migrate_tables_1] failed-to-migrate: Failed to migrate table hive_metastore.[schema].[table] to [catalog].[schema].[table]: INVALID_PARAMETER_VALUE: Invalid input: RPC CreateTable Field managedcatalog.ColumnInfo.name: At columns.0: name "" is not a valid name`

JCZuurmond commented 1 month ago

hmm, do you get more information in the debug logs? See the logs folder in the UCX installation folder in your workspace

maruppel commented 1 month ago

hmm, do you get more information in the debug logs? See the logs folder in the UCX installation folder in your workspace

No other details other than the stack trace.

JCZuurmond commented 1 month ago

Could you share the stack trace? I would like to see which table migration method is called

maruppel commented 1 month ago

Could you share the stack trace? I would like to see which table migration method is called

JVM stacktrace: org.apache.spark.sql.AnalysisException at com.databricks.managedcatalog.ErrorDetailsHandler.wrapServiceException(ErrorDetailsHandler.scala:62) at com.databricks.managedcatalog.ErrorDetailsHandler.wrapServiceException$(ErrorDetailsHandler.scala:35) at com.databricks.managedcatalog.ManagedCatalogClientImpl.wrapServiceException(ManagedCatalogClientImpl.scala:171) at com.databricks.managedcatalog.ManagedCatalogClientImpl.recordAndWrapException(ManagedCatalogClientImpl.scala:5648) at com.databricks.managedcatalog.ManagedCatalogClientImpl.createTable(ManagedCatalogClientImpl.scala:1278) at com.databricks.sql.managedcatalog.PermissionEnforcingManagedCatalog.createManagedOrExternalTable(PermissionEnforcingManagedCatalog.scala:121) at com.databricks.sql.managedcatalog.PermissionEnforcingManagedCatalog.createTable(PermissionEnforcingManagedCatalog.scala:404) at com.databricks.sql.managedcatalog.ProfiledManagedCatalog.$anonfun$createTable$1(ProfiledManagedCatalog.scala:181) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.catalyst.MetricKeyUtils$.measure(MetricKey.scala:1056) at com.databricks.sql.managedcatalog.ProfiledManagedCatalog.$anonfun$profile$1(ProfiledManagedCatalog.scala:62) at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at com.databricks.sql.managedcatalog.ProfiledManagedCatalog.profile(ProfiledManagedCatalog.scala:61) at com.databricks.sql.managedcatalog.ProfiledManagedCatalog.createTable(ProfiledManagedCatalog.scala:181) at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.createTableInternal(ManagedCatalogSessionCatalog.scala:924) at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.createTable(ManagedCatalogSessionCatalog.scala:851) at com.databricks.sql.DatabricksSessionCatalog.createTable(DatabricksSessionCatalog.scala:225) at com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.updateCatalog(CreateDeltaTableCommand.scala:912) at com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.runPostCommitUpdates(CreateDeltaTableCommand.scala:286) at com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.handleCommit(CreateDeltaTableCommand.scala:266) at com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.$anonfun$run$2(CreateDeltaTableCommand.scala:171) at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.withOperationTypeTag(DeltaLogging.scala:225) at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.withOperationTypeTag$(DeltaLogging.scala:212) at com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.withOperationTypeTag(CreateDeltaTableCommand.scala:71) at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$2(DeltaLogging.scala:164) at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:294) at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:292) at com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.recordFrameProfile(CreateDeltaTableCommand.scala:71) at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:163) at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:525) at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:629) at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:647) at com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:48) at com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:244) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:240) at com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:46) at com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:43) at com.databricks.spark.util.PublicDBLogging.withAttributionContext(DatabricksSparkUsageLogger.scala:27) at com.databricks.logging.AttributionContextTracing.withAttributionTags(AttributionContextTracing.scala:95) at com.databricks.logging.AttributionContextTracing.withAttributionTags$(AttributionContextTracing.scala:76) at com.databricks.spark.util.PublicDBLogging.withAttributionTags(DatabricksSparkUsageLogger.scala:27) at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:624) at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:534) at com.databricks.spark.util.PublicDBLogging.recordOperationWithResultTags(DatabricksSparkUsageLogger.scala:27) at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:526) at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:494) at com.databricks.spark.util.PublicDBLogging.recordOperation(DatabricksSparkUsageLogger.scala:27) at com.databricks.spark.util.PublicDBLogging.recordOperation0(DatabricksSparkUsageLogger.scala:68) at com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:150) at com.databricks.spark.util.UsageLogger.recordOperation(UsageLogger.scala:68) at com.databricks.spark.util.UsageLogger.recordOperation$(UsageLogger.scala:55) at com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:109) at com.databricks.spark.util.UsageLogging.recordOperation(UsageLogger.scala:429) at com.databricks.spark.util.UsageLogging.recordOperation$(UsageLogger.scala:408) at com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.recordOperation(CreateDeltaTableCommand.scala:71) at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:162) at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:152) at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:142) at com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.recordDeltaOperation(CreateDeltaTableCommand.scala:71) at com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.run(CreateDeltaTableCommand.scala:149) at org.apache.spark.sql.execution.command.ExecutedCommandExec.$anonfun$sideEffectResult$2(commands.scala:84) at org.apache.spark.sql.execution.SparkPlan.runCommandWithAetherOff(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.runCommandInAetherOrSpark(SparkPlan.scala:191) at org.apache.spark.sql.execution.command.ExecutedCommandExec.$anonfun$sideEffectResult$1(commands.scala:84) at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:81) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:80) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:94) at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$5(QueryExecution.scala:382) at com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63) at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$4(QueryExecution.scala:382) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:186) at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$3(QueryExecution.scala:382) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$9(SQLExecution.scala:400) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:719) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:278) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1179) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:165) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:656) at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$2(QueryExecution.scala:378) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:1176) at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$1(QueryExecution.scala:374) at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$withMVTagsIfNecessary(QueryExecution.scala:325) at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:371) at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:347) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:505) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:83) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:505) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:40) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:379) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:375) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:40) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:40) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:481) at org.apache.spark.sql.execution.QueryExecution.$anonfun$eagerlyExecuteCommands$1(QueryExecution.scala:347) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:436) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:347) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:284) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:281) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:339) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:131) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1179) at org.apache.spark.sql.SparkSession.$anonfun$withActiveAndFrameProfiler$1(SparkSession.scala:1186) at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at org.apache.spark.sql.SparkSession.withActiveAndFrameProfiler(SparkSession.scala:1186) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:122) at org.apache.spark.sql.SparkSession.$anonfun$sql$6(SparkSession.scala:957) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withMainTracker(QueryPlanningTracker.scala:179) at org.apache.spark.sql.SparkSession.$anonfun$sql$5(SparkSession.scala:945) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1179) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:945) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.executeSQL(SparkConnectPlanner.scala:2902) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.handleSqlCommand(SparkConnectPlanner.scala:2744) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.process(SparkConnectPlanner.scala:2685) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handleCommand(ExecuteThreadRunner.scala:319) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:242) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:178) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:333) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1179) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:333) at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:97) at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:85) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:235) at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:84) at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:332) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:178) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:128) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.$anonfun$run$2(ExecuteThreadRunner.scala:526) at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:45) at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104) at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109) at scala.util.Using$.resource(Using.scala:269) at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:525)

FastLee commented 1 month ago

Are these Delta/Non Delta Tables?

FastLee commented 1 month ago

Can you show us the table info from the assessment dashboard?

maruppel commented 1 month ago

Are these Delta/Non Delta Tables?

All are dbfs root delta tables.

JCZuurmond commented 1 month ago

Is there also the Python stack trace to see which Python code triggered this

maruppel commented 1 month ago

Is there also the Python stack trace to see which Python code triggered this

15:20:12 ERROR [p.s.c.client.logging][migrate_tables_7] GRPC Error received: Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/sql/connect/client/core.py", line 1736, in _execute_and_fetch_as_iterator
    for b in generator:
  File "<frozen _collections_abc>", line 330, in __next__
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 135, in send
    if not self._has_next():
           ^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 196, in _has_next
    raise e
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 168, in _has_next
    self._current = self._call_iter(
                    ^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 291, in _call_iter
    raise e
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 271, in _call_iter
    return iter_fun()
           ^^^^^^^^^^
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 169, in <lambda>
    lambda: next(self._iterator)  # type: ignore[arg-type]
            ^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/grpc/_channel.py", line 540, in __next__
    return self._next()
           ^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/grpc/_channel.py", line 966, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
    status = StatusCode.INTERNAL
    details = "INVALID_PARAMETER_VALUE: Invalid input: RPC CreateTable Field managedcatalog.ColumnInfo.name: At columns.0: name "" is not a valid name"
    debug_error_string = "UNKNOWN:Error received from peer unix:/databricks/sparkconnect/grpc.sock {grpc_message:"INVALID_PARAMETER_VALUE: Invalid input: RPC CreateTable Field managedcatalog.ColumnInfo.name: At columns.0: name \"\" is not a valid name", grpc_status:13, created_time:"2024-09-24T15:20:12.518225701+00:00"}"
>
2024-09-24 15:20:12,523 2361 ERROR _handle_rpc_error GRPC Error received
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/sql/connect/client/core.py", line 1736, in _execute_and_fetch_as_iterator
    for b in generator:
  File "<frozen _collections_abc>", line 330, in __next__
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 135, in send
    if not self._has_next():
           ^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 196, in _has_next
    raise e
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 168, in _has_next
    self._current = self._call_iter(
                    ^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 291, in _call_iter
    raise e
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 271, in _call_iter
    return iter_fun()
           ^^^^^^^^^^
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 169, in <lambda>
    lambda: next(self._iterator)  # type: ignore[arg-type]
            ^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/grpc/_channel.py", line 540, in __next__
    return self._next()
           ^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/grpc/_channel.py", line 966, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
    status = StatusCode.INTERNAL
    details = "INVALID_PARAMETER_VALUE: Invalid input: RPC CreateTable Field managedcatalog.ColumnInfo.name: At columns.0: name "" is not a valid name"
    debug_error_string = "UNKNOWN:Error received from peer unix:/databricks/sparkconnect/grpc.sock {grpc_message:"INVALID_PARAMETER_VALUE: Invalid input: RPC CreateTable Field managedcatalog.ColumnInfo.name: At columns.0: name \"\" is not a valid name", grpc_status:13, created_time:"2024-09-24T15:20:12.518225701+00:00"}"
>
15:20:12  WARN [d.l.u.hive_metastore.table_migrate][migrate_tables_7] failed-to-migrate: Failed to migrate table [REDACTED]: INVALID_PARAMETER_VALUE: Invalid input: RPC CreateTable Field managedcatalog.ColumnInfo.name: At columns.0: name "" is not a valid name