databrickslabs / ucx

Automated migrations to Unity Catalog
Other
219 stars 75 forks source link

UCX Assesment Workflow breaks when executing Crawl permissions if Model Serving is not available in the Workspace Region #1228

Closed jarnawer closed 5 months ago

jarnawer commented 5 months ago

Is there an existing issue for this?

Current Behavior

I have a series of Workspaces deployed in Switzerland North Azure Region. Due to regulatory compliance requirements it has to be that exact region.

When executing Assessment Workflow, it breaks in the "Crawl Permissions" step. The reason of the failure is because it is trying to access Model serving endpoints features to crawl permissions, but according to Databricks documentation, it is not enabled in Switzerland (https://learn.microsoft.com/en-us/azure/databricks/machine-learning/model-serving/model-serving-limits#--region-availability)

Expected Behavior

Crawl permissions step, should execute regardless if Model Serving is available on the region or not. It should be configurable to crawl permissions for that.

Steps To Reproduce

No response

Cloud

Azure

Operating System

Linux

Version

latest via Databricks CLI

Relevant log output

UCX v0.17.0
Scans the workspace-local groups and all their permissions. The list is stored in the `$inventory.permissions`
Delta table.

This is the first step for the _group migration_ process, which is continued in the `migrate-groups` workflow.
This step includes preparing Legacy Table ACLs for local group migration.

15:19:46  INFO [d.labs.ucx] UCX v0.17.0 After job finishes, see debug logs at /Workspace/Applications/ucx/logs/assessment/run-413746228294294/crawl_permissions.log
15:19:46  INFO [d.l.u.workspace_access.manager] Cleaning up inventory table hive_metastore.ucx.permissions
15:19:46  INFO [d.l.u.workspace_access.manager] Inventory table cleanup complete
15:19:46  INFO [d.l.u.workspace_access.generic] Listed clusters in 0:00:00.047167
15:19:46  INFO [d.l.u.workspace_access.generic] Listed cluster-policies in 0:00:00.038094
15:19:46  INFO [d.l.u.workspace_access.generic] Listed instance-pools in 0:00:00.049407
15:19:46  INFO [d.l.u.workspace_access.generic] Listed sql/warehouses in 0:00:00.037511
15:19:46  INFO [d.l.u.workspace_access.generic] Listed jobs in 0:00:00.162886
15:19:46  INFO [d.l.u.workspace_access.generic] Listed pipelines in 0:00:00.027808
15:19:46 ERROR [d.labs.ucx] Execute `databricks workspace export //Applications/ucx/logs/assessment/run-413746228294294/crawl_permissions.log` locally to troubleshoot with more details. Model serving is not enabled for your shard. Please contact your organization admin or Databricks support.
nfx commented 5 months ago

@jarnawer Please include the exact stack trace.

jarnawer commented 5 months ago

Hi @nfx, what exact stack trace do you mean?. I can provide the standard error output, but has the same information.

nfx commented 5 months ago

@jarnawer we need to know the exact exception type it fails with and exact methods and lines. it has to be in that log ;)

Execute databricks workspace export //Applications/ucx/logs/assessment/run-413746228294294/crawl_permissions.log locally to troubleshoot with more details. Model serving is not enabled for your shard. Please contact your organization admin or Databricks support.

jarnawer commented 5 months ago

Wow, apologies for not reading that part. I missed it completely, sorry. Here is the export of that log:

15:19:46 INFO [databricks.labs.ucx] {MainThread} UCX v0.17.0 After job finishes, see debug logs at /Workspace/Applications/ucx/logs/assessment/run-413746228294294/crawl_permissions.log
15:19:46 DEBUG [databricks.sdk] {MainThread} GET /api/2.0/preview/scim/v2/Groups?attributes=id,displayName,meta,roles,entitlements&startIndex=1&count=100
< 200 OK
< {
<   "Resources": [
<     {
<       "displayName": "G-dspA0916001-chn-test-Contributor",
<       "entitlements": [
<         {
<           "value": "**REDACTED**"
<         },
<         {
<           "value": "**REDACTED**"
<         },
<         {
<           "value": "**REDACTED**"
<         },
<         "... (1 additional elements)"
<       ],
<       "id": "133503787841503",
<       "meta": {
<         "resourceType": "Group"
<       }
<     },
<     "... (7 additional elements)"
<   ],
<   "itemsPerPage": 8,
<   "schemas": [
<     "urn:ietf:params:scim:api:messages:2.0:ListResponse"
<   ],
<   "startIndex": 1,
<   "totalResults": 8
< }
15:19:46 DEBUG [databricks.sdk] {MainThread} GET /api/2.0/preview/scim/v2/Groups?attributes=id,displayName,meta,roles,entitlements&startIndex=9&count=100
< 200 OK
< {
<   "itemsPerPage": 0,
<   "schemas": [
<     "urn:ietf:params:scim:api:messages:2.0:ListResponse"
<   ],
<   "startIndex": 9,
<   "totalResults": 8
< }
15:19:46 INFO [databricks.labs.ucx.workspace_access.manager] {MainThread} Cleaning up inventory table hive_metastore.ucx.permissions
15:19:46 DEBUG [databricks.labs.lsql.backends] {MainThread} [spark][execute] DROP TABLE IF EXISTS hive_metastore.ucx.permissions
15:19:46 INFO [databricks.labs.ucx.workspace_access.manager] {MainThread} Inventory table cleanup complete
15:19:46 DEBUG [databricks.labs.ucx.workspace_access.manager] {MainThread} Crawling permissions
15:19:46 DEBUG [databricks.sdk] {MainThread} GET /api/2.0/clusters/list
< 200 OK
< {
<   "clusters": [
<     {
<       "autotermination_minutes": 0,
<       "azure_attributes": {
<         "availability": "ON_DEMAND_AZURE",
<         "first_on_demand": 1,
<         "spot_bid_max_price": -1.0
<       },
<       "cluster_cores": 4.0,
<       "cluster_id": "0321-151452-bfwajp40",
<       "cluster_memory_mb": 8192,
<       "cluster_name": "job-261840359341744-run-413746228294294-main",
<       "cluster_source": "JOB",
<       "creator_user_name": "***",
<       "custom_tags": {
<         "ResourceClass": "SingleNode",
<         "version": "v0.17.0"
<       },
<       "data_security_mode": "LEGACY_SINGLE_USER",
<       "default_tags": {
<         "ClusterId": "0321-151452-bfwajp40",
<         "ClusterName": "job-261840359341744-run-413746228294294-main",
<         "Creator": "***",
<         "JobId": "261840359341744",
<         "RunName": "[UCX] assessment",
<         "Vendor": "Databricks",
<         "applicationId": "A0916",
<         "applicationName": "CIT-O DSP",
<         "environment": "test",
<         "expirationDate": "2022-12-31",
<         "owner": "***",
<         "platformId": "PLF0070",
<         "requester": "***",
<         "serviceCode": "MRCS"
<       },
<       "disk_spec": {},
<       "driver": {
<         "host_private_ip": "10.44.15.199",
<         "instance_id": "2baab114e64041c2a62856d405663f62",
<         "node_attributes": {
<           "is_spot": false
<         },
<         "node_id": "d66c5c9136794e11b7c8fc857ba3eeac",
<         "private_ip": "10.44.15.136",
<         "public_dns": "",
<         "start_timestamp": 1711034093240
<       },
<       "driver_healthy": true,
<       "driver_instance_source": {
<         "node_type_id": "Standard_F4s"
<       },
<       "driver_node_type_id": "Standard_F4s",
<       "effective_spark_version": "14.3.x-scala2.12",
<       "enable_elastic_disk": true,
<       "enable_local_disk_encryption": false,
<       "init_scripts_safe_mode": false,
<       "instance_source": {
<         "node_type_id": "Standard_F4s"
<       },
<       "jdbc_port": 10000,
<       "last_activity_time": 1711034138911,
<       "last_restarted_time": 1711034180128,
<       "last_state_loss_time": 0,
<       "node_type_id": "Standard_F4s",
<       "num_workers": 0,
<       "policy_id": "00103CF50C32348D",
<       "single_user_name": "***",
<       "spark_conf": {
<         "spark.databricks.cluster.profile": "singleNode",
<         "spark.master": "local[*]"
<       },
<       "spark_context_id": 2439339821249193680,
<       "spark_version": "14.3.x-scala2.12",
<       "start_time": 1711034092311,
<       "state": "RUNNING",
<       "state_message": ""
<     },
<     "... (27 additional elements)"
<   ]
< }
15:19:46 INFO [databricks.labs.ucx.workspace_access.generic] {MainThread} Listed clusters in 0:00:00.047167
15:19:46 DEBUG [databricks.sdk] {MainThread} GET /api/2.0/policies/clusters/list
< 200 OK
< {
<   "policies": [
<     {
<       "created_at_timestamp": 1709814294000,
<       "definition": "{\"access_mode\":{\"hidden\":true,\"type\":\"fixed\",\"value\":\"SINGLE_USER\"},\"autotermination_minutes\":{\"... (876 more bytes)",
<       "is_default": false,
<       "name": "Data Engineer Cluster Policy",
<       "policy_id": "0004B0557261BC4E"
<     },
<     "... (9 additional elements)"
<   ],
<   "total_count": 10
< }
15:19:46 INFO [databricks.labs.ucx.workspace_access.generic] {MainThread} Listed cluster-policies in 0:00:00.038094
15:19:46 DEBUG [databricks.sdk] {MainThread} GET /api/2.0/instance-pools/list
< 200 OK
< {}
15:19:46 INFO [databricks.labs.ucx.workspace_access.generic] {MainThread} Listed instance-pools in 0:00:00.049407
15:19:46 DEBUG [databricks.sdk] {MainThread} GET /api/2.0/sql/warehouses
< 200 OK
< {
<   "warehouses": [
<     {
<       "auto_resume": true,
<       "auto_stop_mins": 120,
<       "cluster_size": "X-Small",
<       "creator_id": 8956159007198419,
<       "creator_name": "9b6c040b-3cc8-4280-be90-f6e705f5c25d",
<       "enable_photon": true,
<       "enable_serverless_compute": false,
<       "health": {
<         "status": "HEALTHY"
<       },
<       "id": "267906863f8dfbbc",
<       "jdbc_url": "jdbc:spark://adb-2923397360314017.17.azuredatabricks.net:443/default;transportMode=http;ssl=1;Au... (55 more bytes)",
<       "max_num_clusters": 5,
<       "min_num_clusters": 1,
<       "name": "Data Engineer SQL Warehouse",
<       "num_active_sessions": 0,
<       "num_clusters": 1,
<       "odbc_params": {
<         "hostname": "adb-2923397360314017.17.azuredatabricks.net",
<         "path": "/sql/1.0/warehouses/267906863f8dfbbc",
<         "port": 443,
<         "protocol": "https"
<       },
<       "size": "XSMALL",
<       "spot_instance_policy": "COST_OPTIMIZED",
<       "state": "RUNNING",
<       "tags": {
<         "custom_tags": [
<           {
<             "key": "user_group",
<             "value": "**REDACTED**"
<           }
<         ]
<       },
<       "warehouse_type": "PRO"
<     },
<     "... (2 additional elements)"
<   ]
< }
15:19:46 INFO [databricks.labs.ucx.workspace_access.generic] {MainThread} Listed sql/warehouses in 0:00:00.037511
15:19:46 DEBUG [databricks.sdk] {MainThread} GET /api/2.1/jobs/list
< 200 OK
< {
<   "has_more": true,
<   "jobs": [
<     {
<       "created_time": 1710846542457,
<       "creator_user_name": "***",
<       "job_id": 261840359341744,
<       "settings": {
<         "email_notifications": {
<           "no_alert_for_skipped_runs": false,
<           "on_failure": [
<             "***"
<           ],
<           "on_success": [
<             "***"
<           ]
<         },
<         "format": "MULTI_TASK",
<         "max_concurrent_runs": 1,
<         "name": "[UCX] assessment",
<         "tags": {
<           "version": "v0.17.0"
<         },
<         "timeout_seconds": 0
<       }
<     },
<     "... (19 additional elements)"
<   ],
<   "next_page_token": "CAEo7rm7080xMJvl09WB65YB"
< }
15:19:46 DEBUG [databricks.sdk] {MainThread} GET /api/2.1/jobs/list?page_token=CAEo7rm7080xMJvl09WB65YB
< 200 OK
< {
<   "has_more": false,
<   "jobs": [
<     {
<       "created_time": 1699007989618,
<       "creator_user_name": "***",
<       "job_id": 354194666446075,
<       "settings": {
<         "email_notifications": {
<           "no_alert_for_skipped_runs": false
<         },
<         "format": "MULTI_TASK",
<         "max_concurrent_runs": 10,
<         "name": "create_table",
<         "timeout_seconds": 0
<       }
<     },
<     "... (1 additional elements)"
<   ],
<   "prev_page_token": "CAAo8o6SprkxMPu5mfq1xFA="
< }
15:19:46 INFO [databricks.labs.ucx.workspace_access.generic] {MainThread} Listed jobs in 0:00:00.162886
15:19:46 DEBUG [databricks.sdk] {MainThread} GET /api/2.0/pipelines
< 200 OK
< {}
15:19:46 INFO [databricks.labs.ucx.workspace_access.generic] {MainThread} Listed pipelines in 0:00:00.027808
15:19:46 DEBUG [databricks.sdk] {MainThread} GET /api/2.0/serving-endpoints
< 404 Not Found
< {
<   "error_code": "FEATURE_DISABLED",
<   "message": "Model serving is not enabled for your shard. Please contact your organization admin or Databrick... (10 more bytes)"
< }
15:19:46 ERROR [databricks.labs.ucx] {MainThread} Execute `databricks workspace export //Applications/ucx/logs/assessment/run-413746228294294/crawl_permissions.log` locally to troubleshoot with more details. Model serving is not enabled for your shard. Please contact your organization admin or Databricks support.
15:19:46 DEBUG [databricks] {MainThread} Task crash details
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/framework/tasks.py", line 255, in run_task
    current_task.fn(cfg, workspace_client, sql_backend, installation)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/runtime.py", line 240, in crawl_permissions
    permission_manager.inventorize_permissions()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/workspace_access/manager.py", line 94, in inventorize_permissions
    crawler_tasks = list(self._get_crawler_tasks())
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/workspace_access/manager.py", line 221, in _get_crawler_tasks
    yield from support.get_crawler_tasks()
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/workspace_access/generic.py", line 74, in get_crawler_tasks
    for info in listing:
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/labs/ucx/workspace_access/generic.py", line 58, in __iter__
    for item in self._func():
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/sdk/service/serving.py", line 2596, in list
    json = self._api.do('GET', '/api/2.0/serving-endpoints', headers=headers)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/sdk/core.py", line 130, in do
    response = retryable(self._perform)(method,
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/sdk/retries.py", line 54, in wrapper
    raise err
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/sdk/retries.py", line 33, in wrapper
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/databricks/sdk/core.py", line 238, in _perform
    raise self._make_nicer_error(response=response, **payload) from None
databricks.sdk.errors.platform.NotFound: Model serving is not enabled for your shard. Please contact your organization admin or Databricks support.
nfx commented 5 months ago

@jarnawer okay, the fix is simple: surround these lines with try: ... except NotFound: pass or something like that.

https://github.com/databrickslabs/ucx/blob/a2a939d9dcb257e8131bc4634075e126c51e94c2/src/databricks/labs/ucx/workspace_access/generic.py#L58-L59

RobMandersBJSS commented 4 months ago

I'm sorry to say it does not appear this issue has been resolved.

I've attempted to re-run the assessment pipeline today following the release of v0.22.0 which includes the fix in PR #1275, however, the crawl_permissions jobs is still failing on the same 'FEATURE_DISABLED' error I was experiencing last week (v0.21.0).

This was a fresh installation of UCX and I have confirmed I am running the latest version:

image image

// version.json
{
  "version": "0.22.0",
  "wheel": "/Applications/ucx/wheels/databricks_labs_ucx-0.22.0-py3-none-any.whl",
  "date": "2024-04-29T08:43:20.038438+00:00"
}

I've attached the full logs in crawl_permissions.log, here:

crawl_permissions.log

Please let me know if there is anything you need or if I've missed some configuration.

Thank you.

rafa-arana commented 2 months ago

Just hit this in UCX v0.27.1

InternalError: Listing serving-endpoints failed: Model serving is not enabled for your shard. Please contact your organization admin or Databricks support.
---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
File ~/.ipykernel/4656/command--1-3817622542:18
     15 entry = [ep for ep in metadata.distribution("databricks_labs_ucx").entry_points if ep.name == "runtime"]
     16 if entry:
     17   # Load and execute the entrypoint, assumes no parameters
---> 18   entry[0].load()()
     19 else:
     20   import importlib

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/runtime.py:103, in main(*argv)
    101 if len(argv) == 0:
    102     argv = sys.argv
--> 103 Workflows.all().trigger(*argv)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/runtime.py:80, in Workflows.trigger(self, *argv)
     78 workflow = self._workflows[workflow_name]
     79 if task_name == "parse_logs":
---> 80     return ctx.task_run_warning_recorder.snapshot()
     81 # `{{parent_run_id}}` is the run of entire workflow, whereas `{{run_id}}` is the run of a task
     82 workflow_run_id = named_parameters.get("parent_run_id", "unknown_run_id")

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/databricks/labs/ucx/installer/logs.py:203, in TaskRunWarningRecorder.snapshot(self)
    201     error_messages.append(message)
    202 if len(error_messages) > 0:
--> 203     raise InternalError("\n".join(error_messages))
    204 return log_records

InternalError: Listing serving-endpoints failed: Model serving is not enabled for your shard. Please contact your organization admin or Databricks support.
UCX v0.27.1

UCX v0.27.1