databrickslabs / ucx

Automated migrations to Unity Catalog
Other
216 stars 75 forks source link

[BUG]: ConcurrentDeleteReadException in migrate-view task during table migration #2209

Closed beata-bot closed 3 weeks ago

beata-bot commented 1 month ago

Is there an existing issue for this?

Current Behavior

Issue with migrate-views task in table migration which is related to concurrent operations on UCX internal database on workspace. Only part of the views is properly migrated. After running the table migration workflow multiple times the views get finally migrated. (Views did not have any dependency between each other)

ManyError: Detected 2 failures: Unknown: [DELTA_CONCURRENT_DELETE_READ] ConcurrentDeleteReadException: This transaction attempted to read one or more files that were deleted (for example part-00000-0df56665-a002-4e1a-8c38-add89f9c16f3-c000.snappy.parquet in the root of the table) by a concurrent update. Please try the operation again. Conflicting commit: {"timestamp":1721292413579,"userId":"5695979632839528","userName":"jil.scott@domain.com","operation":"DELETE","operationParameters":{"predicate":["true"]},"job":{"jobId":"945981217581767","jobName":"[UCX] migrate-tables","jobRunId":"619297246306369","runId":"995793153947461","jobOwnerId":"5695979632839528","triggerType":"manual"},"clusterId":"0718-083610-suw33de7","readVersion":73,"isolationLevel":"WriteSerializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"3090","numCopiedRows":"0","numDeletionVectorsAdded":"0","numDeletionVectorsRemoved":"0","numAddedChangeFiles":"0","executionTimeMs":"917","numDeletionVectorsUpdated":"0","numDeletedRows":"42","scanTimeMs":"916","numAddedFiles":"0","numAddedBytes":"0","rewriteTimeMs":"0"},"tags":{"noRowsCopied":"true","delta.rowTracking.preserved":"false","restoresDeletedRows":"false"},"engineInfo":"Databricks-Runtime/15.3.x-scala2.12","txnId":"2395993f-f87a-45f9-be39-52875d3d7793"} Refer to https://docs.microsoft.com/azure/databricks/delta/concurrency-control for more details.

Expected Behavior

No such concurrency error. One run of migrate-views should be enough to migrate all views.

Steps To Reproduce

Environment: Azure cloud. UCX version: v0.28.2

Config:

inventory_database: ucx
log_level: INFO
max_workers: 10
min_workers: 1
num_days_submit_runs_history: 30
num_threads: 8
policy_id: 00083DC1385D3B40
recon_tolerance_percent: 5
renamed_group_prefix: ucx-temp-
trigger_job: true
version: 2
warehouse_id: cef3a02b19a39406
workspace_group_regex: ^
workspace_start_path: /

Steps:

All Tables migrated successfully using SYNC (all external tables).

Cloud

Azure

Operating System

Linux

Version

latest via Databricks CLI

Relevant log output

08:48:04 DEBUG [databricks.labs.lsql.backends] {migrate_views_0} [spark][execute] ALTER VIEW analytics_qa.legato.gold_vw_dim_aircraft_realtime SET TBLPROPERTIES ('upgraded_from' ... (109 more bytes)
08:48:05 DEBUG [databricks.labs.ucx.hive_metastore.table_migrate] {migrate_views_0} Migrating acls on analytics_qa.legato.gold_vw_dim_aircraft_realtime using SQL query: ALTER VIEW analytics_qa.legato.gold_vw_dim_aircraft_realtime OWNER TO `piotr.blaszczak@domain.com`
08:48:07 INFO [databricks.labs.blueprint.parallel] {migrate_views_0} migrate views 3/3, rps: 0.040/sec
08:48:07 ERROR [databricks.labs.blueprint.parallel] {MainThread} More than half 'migrate views' tasks failed: 33% results available (1/3). Took 0:01:15.471772
08:48:07 ERROR [databricks.labs.ucx] {MainThread} Execute `databricks workspace export //Applications/ucx/logs/migrate-tables/run-619297246306369-0/migrate_views.log` locally to troubleshoot with more details. Detected 2 failures: Unknown: [DELTA_CONCURRENT_DELETE_READ] ConcurrentDeleteReadException: This transaction attempted to read one or more files that were deleted (for example part-00000-0df56665-a002-4e1a-8c38-add89f9c16f3-c000.snappy.parquet in the root of the table) by a concurrent update. Please try the operation again.
Conflicting commit: {"timestamp":1721292413579,"userId":"5695979632839528","userName":"jil.scott@domain.com","operation":"DELETE","operationParameters":{"predicate":["true"]},"job":{"jobId":"945981217581767","jobName":"[UCX] migrate-tables","jobRunId":"619297246306369","runId":"995793153947461","jobOwnerId":"5695979632839528","triggerType":"manual"},"clusterId":"0718-083610-suw33de7","readVersion":73,"isolationLevel":"WriteSerializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"3090","numCopiedRows":"0","numDeletionVectorsAdded":"0","numDeletionVectorsRemoved":"0","numAddedChangeFiles":"0","executionTimeMs":"917","numDeletionVectorsUpdated":"0","numDeletedRows":"42","scanTimeMs":"916","numAddedFiles":"0","numAddedBytes":"0","rewriteTimeMs":"0"},"tags":{"noRowsCopied":"true","delta.rowTracking.preserved":"false","restoresDeletedRows":"false"},"engineInfo":"Databricks-Runtime/15.3.x-scala2.12","txnId":"2395993f-f87a-45f9-be39-52875d3d7793"}
Refer to https://docs.microsoft.com/azure/databricks/delta/concurrency-control for more details.
08:48:07 DEBUG [databricks] {MainThread} Task crash details
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/runtime.py", line 100, in trigger
    current_task(ctx)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/hive_metastore/workflows.py", line 63, in migrate_views
    ctx.tables_migrator.migrate_tables(
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/hive_metastore/table_migrate.py", line 87, in migrate_tables
    return self._migrate_views(acl_strategy, all_grants_to_migrate, all_migrated_groups, all_principal_grants)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/hive_metastore/table_migrate.py", line 140, in _migrate_views
    Threads.strict("migrate views", tasks)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/blueprint/parallel.py", line 63, in strict
    raise ManyError(errs)
databricks.labs.blueprint.parallel.ManyError: Detected 2 failures: Unknown: [DELTA_CONCURRENT_DELETE_READ] ConcurrentDeleteReadException: This transaction attempted to read one or more files that were deleted (for example part-00000-0df56665-a002-4e1a-8c38-add89f9c16f3-c000.snappy.parquet in the root of the table) by a concurrent update. Please try the operation again.
Conflicting commit: {"timestamp":1721292413579,"userId":"5695979632839528","userName":"jil.scott@domain.com","operation":"DELETE","operationParameters":{"predicate":["true"]},"job":{"jobId":"945981217581767","jobName":"[UCX] migrate-tables","jobRunId":"619297246306369","runId":"995793153947461","jobOwnerId":"5695979632839528","triggerType":"manual"},"clusterId":"0718-083610-suw33de7","readVersion":73,"isolationLevel":"WriteSerializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"3090","numCopiedRows":"0","numDeletionVectorsAdded":"0","numDeletionVectorsRemoved":"0","numAddedChangeFiles":"0","executionTimeMs":"917","numDeletionVectorsUpdated":"0","numDeletedRows":"42","scanTimeMs":"916","numAddedFiles":"0","numAddedBytes":"0","rewriteTimeMs":"0"},"tags":{"noRowsCopied":"true","delta.rowTracking.preserved":"false","restoresDeletedRows":"false"},"engineInfo":"Databricks-Runtime/15.3.x-scala2.12","txnId":"2395993f-f87a-45f9-be39-52875d3d7793"}
Refer to https://docs.microsoft.com/azure/databricks/delta/concurrency-control for more details.
srhstas commented 1 month ago

I encountered the same issue, and multiple attempts to rerun didn't help. Previously, a second try would complete the workflow without an error. Here is a more specific log to help understand what's happening. Screenshot 2024-07-25 at 17 18 14