databrickslabs / ucx

Automated migrations to Unity Catalog
Other
235 stars 80 forks source link

[BUG]: Do not fail on `InternalError` in `Listing` #2582

Closed lotsahelp closed 1 month ago

lotsahelp commented 2 months ago

Is there an existing issue for this?

Current Behavior

crawl_permissions fails while running.

Expected Behavior

crawl_permissions should run to completion

Steps To Reproduce

No response

Cloud

Azure

Operating System

macOS

Version

latest via Databricks CLI

Relevant log output

UCX v0.35.0

22:37:34  INFO [d.labs.ucx] UCX v0.35.0 After job finishes, see debug logs at /Workspace/Applications/ucx/logs/assessment/run-1028440331975495-0/crawl_permissions.log
22:37:38  INFO [d.l.u.workspace_access.manager] Cleaning up inventory table hive_metastore.ucx.permissions
22:37:39  INFO [d.l.u.workspace_access.manager] Inventory table cleanup complete
22:41:54  INFO [d.l.u.workspace_access.generic] Listed clusters in 0:04:14.580991
22:41:54  INFO [d.l.u.workspace_access.generic] Listed cluster-policies in 0:00:00.123482
22:41:54  INFO [d.l.u.workspace_access.generic] Listed instance-pools in 0:00:00.148827
22:41:54  INFO [d.l.u.workspace_access.generic] Listed sql/warehouses in 0:00:00.149331
22:41:55  INFO [d.l.u.workspace_access.generic] Listed jobs in 0:00:00.847645
22:41:55  INFO [d.l.u.workspace_access.generic] Listed pipelines in 0:00:00.317527
22:41:56  INFO [d.l.u.workspace_access.generic] Listed serving-endpoints in 0:00:00.289363
22:41:56  INFO [d.l.u.workspace_access.generic] Listed experiments in 0:00:00.619155
22:42:02 ERROR [d.labs.ucx] Execute `databricks workspace export //Applications/ucx/logs/assessment/run-1028440331975495-0/crawl_permissions.log` locally to troubleshoot with more details. None
InternalError: None
---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
File ~/.ipykernel/4414/command--1-1663341579:18
     15 entry = [ep for ep in metadata.distribution("databricks_labs_ucx").entry_points if ep.name == "runtime"]
     16 if entry:
     17   # Load and execute the entrypoint, assumes no parameters
---> 18   entry[0].load()()
     19 else:
     20   import importlib

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/runtime.py:107, in main(*argv)
    105 if len(argv) == 0:
    106     argv = sys.argv
--> 107 Workflows.all().trigger(*argv)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/runtime.py:100, in Workflows.trigger(self, *argv)
     98 ucx_logger.info(f"UCX v{__version__} After job finishes, see debug logs at {task_logger}")
     99 current_task = getattr(workflow, task_name)
--> 100 current_task(ctx)
    101 return None

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/assessment/workflows.py:172, in Assessment.crawl_permissions(self, ctx)
    170 permission_manager = ctx.permission_manager
    171 permission_manager.cleanup()
--> 172 permission_manager.inventorize_permissions()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/workspace_access/manager.py:31, in PermissionManager.inventorize_permissions(self)
     28 def inventorize_permissions(self):
     29     # TODO: rename into snapshot()
     30     logger.debug("Crawling permissions")
---> 31     crawler_tasks = list(self._get_crawler_tasks())
     32     logger.info(f"Starting to crawl permissions. Total tasks: {len(crawler_tasks)}")
     33     items, errors = Threads.gather("crawl permissions", crawler_tasks)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/workspace_access/manager.py:164, in PermissionManager._get_crawler_tasks(self)
    162 def _get_crawler_tasks(self) -> Iterator[Callable[..., Permissions | None]]:
    163     for support in self._acl_support:
--> 164         yield from support.get_crawler_tasks()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/workspace_access/generic.py:92, in GenericPermissionsSupport.get_crawler_tasks(self)
     90     return
     91 for listing in self._listings:
---> 92     for info in listing:
     93         yield partial(self._crawler_task, info.request_type, info.object_id)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/workspace_access/generic.py:60, in Listing.__iter__(self)
     58 started = datetime.datetime.now()
     59 try:
---> 60     for item in self._func():
     61         yield GenericPermissionsInfo(getattr(item, self._id_attribute), self._object_type)
     62 except NotFound as e:

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/workspace_access/generic.py:406, in models_listing.<locals>.inner()
    404 def inner() -> Iterator[ml.ModelDatabricks]:
    405     tasks = []
--> 406     for model in ws.model_registry.list_models():
    407         tasks.append(partial(ws.model_registry.get_model, name=model.name))
    408     models, errors = Threads.gather("listing model ids", tasks, num_threads)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/sdk/service/ml.py:5300, in ModelRegistryAPI.list_models(self, max_results, page_token)
   5297 headers = {'Accept': 'application/json', }
   5299 while True:
-> 5300     json = self._api.do('GET', '/api/2.0/mlflow/registered-models/list', query=query, headers=headers)
   5301     if 'registered_models' in json:
   5302         for v in json['registered_models']:

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/sdk/core.py:155, in ApiClient.do(self, method, path, url, query, headers, body, raw, files, data, auth, response_headers)
    151 headers['User-Agent'] = self._user_agent_base
    152 retryable = retried(timeout=timedelta(seconds=self._retry_timeout_seconds),
    153                     is_retryable=self._is_retryable,
    154                     clock=self._cfg.clock)
--> 155 response = retryable(self._perform)(method,
    156                                     url,
    157                                     query=query,
    158                                     headers=headers,
    159                                     body=body,
    160                                     raw=raw,
    161                                     files=files,
    162                                     data=data,
    163                                     auth=auth)
    165 resp = dict()
    166 for header in response_headers if response_headers else []:

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/sdk/retries.py:54, in retried.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
     50         retry_reason = f'{type(err).__name__} is allowed to retry'
     52 if retry_reason is None:
     53     # raise if exception is not retryable
---> 54     raise err
     56 logger.debug(f'Retrying: {retry_reason} (sleeping ~{sleep}s)')
     57 clock.sleep(sleep + random())

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/sdk/retries.py:33, in retried.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
     31 while clock.time() < deadline:
     32     try:
---> 33         return func(*args, **kwargs)
     34     except Exception as err:
     35         last_err = err

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/sdk/core.py:273, in ApiClient._perform(self, method, url, query, headers, body, raw, files, data, auth)
    271     if is_too_many_requests_or_unavailable:
    272         error.retry_after_secs = self._parse_retry_after(response)
--> 273     raise error from None
    274 return response

InternalError: None
22:41:56 INFO [databricks.labs.ucx.workspace_access.generic] {MainThread} Listed experiments in 0:00:00.619155
22:42:02 DEBUG [databricks.sdk] {MainThread} GET /api/2.0/mlflow/registered-models/list
< 500 Internal Server Error
< {
<   "error_code": "INTERNAL_ERROR",
<   "message": ""
< }
JCZuurmond commented 2 months ago

@lotsahelp : Thank you for opening the issue. Unfortunately the error message is not useful 😒 . It looks like it fails on the registered-models endpoint. Could you verify if that endpoint does not work for your workspace? It might be disabled. See docs here: https://docs.databricks.com/api/workspace/registeredmodels/list

My first hunch is that ucx should handle an InternalError for a listing like we do for NotFound: https://github.com/databrickslabs/ucx/blob/de701123f4e091e0212f0151934969481b6d30c6/src/databricks/labs/ucx/workspace_access/generic.py#L57C1-L65C62

lotsahelp commented 1 month ago

@JCZuurmond after I called them directly via another method, I was able to repair the Assessment run and it seemed to work.

JCZuurmond commented 1 month ago

@lotsahelp : Thank you for reporting back. It is interesting that solved the issue, maybe the service was down exactly at the time that the crawl_permissions ran. I'll keep this issue open to make our code more robust and handle it with logging a warning instead of failing half way with an error