databrickslabs / ucx

Automated migrations to Unity Catalog
Other
240 stars 86 forks source link

TableSizeCrawler explodes on an expired delta share; it also has gap in exception handling & logging #778

Closed dmoore247 closed 10 months ago

dmoore247 commented 10 months ago

The newly implemented estimate_table_size_for_migration will propagate errors and crash the entire task via uncatchable Py4JJavaError. Logically, delta shares should be skipped when crawling for size estimates.

Advice: all table level calls:

com.google.common.util.concurrent.UncheckedExecutionException: io.delta.sharing.client.util.UnexpectedHttpStatus: HTTP request failed with status: HTTP/1.1 401 Unauthorized {"errorCode":"CUSTOMER_UNAUTHORIZED","message":"Unauthorized"}. It may be caused by an expired token as it has expired at 2023-04-01T00:00:00.000Z
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/hive_metastore/table_size.py:71, in TableSizeCrawler._safe_get_table_size(self, table_full_name)
     70 try:
---> 71     return self._spark._jsparkSession.table(table_full_name).queryExecution().analyzed().stats().sizeInBytes()
     72 except Exception as e:

File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:

File /databricks/spark/python/pyspark/errors/exceptions/captured.py:188, in capture_sql_exception.<locals>.deco(*a, **kw)
    187 try:
--> 188     return f(*a, **kw)
    189 except Py4JJavaError as e:

File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
nfx commented 10 months ago

Thank you for the feature request! Currently, the team operates in a limited capacity, carefully prioritizing, and we cannot provide a timeline to implement this feature. Please make a Pull Request if you'd like to see this feature sooner, and we'll guide you through the journey.

dmoore247 commented 10 months ago

Also fails with:

22:43:43  INFO [d.labs.ucx] UCX v0.9.0 After job finishes, see debug logs at /Workspace/Users/first.last@databricks.com/.ucx/logs/assessment/run-67162835779538/estimate_table_size_for_migration.log
22:43:53 ERROR [d.labs.ucx] Task crashed. Execute `databricks workspace export /Users/first.last@databricks.com/.ucx/logs/assessment/run-67162835779538/estimate_table_size_for_migration.log` locally to troubleshoot with more details. [DELTA_TABLE_NOT_FOUND] Delta table `00_leone_retail`.`bd_test_tab1` doesn't exist.

traceback:

22:43:53 ERROR [databricks.labs.ucx] {MainThread} Task crashed. Execute `databricks workspace export /Users/douglas.moore@databricks.com/.ucx/logs/assessment/run-67162835779538/estimate_table_size_for_migration.log` locally to troubleshoot with more details. [DELTA_TABLE_NOT_FOUND] Delta table `00_leone_retail`.`bd_test_tab1` doesn't exist.
22:43:53 DEBUG [databricks] {MainThread} Task crash details
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/hive_metastore/table_size.py", line 71, in _safe_get_table_size
    return self._spark._jsparkSession.table(table_full_name).queryExecution().analyzed().stats().sizeInBytes()
  File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
  File "/databricks/spark/python/pyspark/errors/exceptions/captured.py", line 194, in deco
    raise converted from None
pyspark.errors.exceptions.captured.AnalysisException: [DELTA_TABLE_NOT_FOUND] Delta table `00_leone_retail`.`bd_test_tab1` doesn't exist.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/framework/tasks.py", line 179, in trigger
    current_task.fn(cfg)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/runtime.py", line 68, in estimate_table_size_for_migration
    table_size.snapshot()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/hive_metastore/table_size.py", line 66, in snapshot
    return self._snapshot(partial(self._try_load), partial(self._crawl))
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/framework/crawlers.py", line 283, in _snapshot
    loaded_records = list(loader())
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/hive_metastore/table_size.py", line 45, in _crawl
    size_in_bytes = self._safe_get_table_size(table.key)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/hive_metastore/table_size.py", line 76, in _safe_get_table_size
    raise RuntimeError(str(e)) from e
RuntimeError: [DELTA_TABLE_NOT_FOUND] Delta table `00_leone_retail`.`bd_test_tab1` doesn't exist.
dmoore247 commented 10 months ago

@FastLee @mwojtyczka ^^

nfx commented 10 months ago

just happened during the demo:

image
mwojtyczka commented 10 months ago

we are not crawling shares as part of TableSizeCrawler task. DELTA_TABLE_NOT_FOUND seems to be another type of exception we need to catch apart from TABLE_OR_VIEW_NOT_FOUND

dmoore247 commented 10 months ago

@mwojtyczka users can create tables with USING deltaSharing

CREATE TABLE hive_metastore.default.index_reports (
  )
USING deltaSharing
LOCATION 'dbfs:/tmp/pt_config.share%23price-transparency-workshop.pt_stage.index_reports'

Must assume any table can be 'broken' and will throw an exception when describe is used.

dmoore247 commented 10 months ago

While describe type commands throw exceptions on malformed tables, show create does not... image