Added pytesseract to known list (#3235). A new addition has been made to the known.json file, which tracks packages with native code, to include pytesseract, an Optical Character Recognition (OCR) tool for Python. This change improves the handling of pytesseract within the codebase and addresses part of issue #1931, likely concerning the seamless incorporation of pytesseract and its native components. However, specific details on the usage of pytesseract within the project are not provided in the diff. Thus, further context or documentation may be necessary for a complete understanding of the integration. Nonetheless, this commit simplifies and clarifies the codebase's treatment of pytesseract and its native dependencies, making it easier to work with.
Added hyperlink to database names in database summary dashboard (#3310). The recent change to the Database Summary dashboard includes the addition of clickable database names, opening a new tab with the corresponding database page. This has been accomplished by adding a linkUrlTemplate property to the database field in the encodings object within the overrides property of the dashboard configuration. The commit also includes tests to verify the new functionality in the labs environment and addresses issue #3258. Furthermore, the display of various other statistics, such as the number of tables, views, and grants, have been improved by converting them to links, enhancing the overall usability and navigation of the dashboard.
Bump codecov/codecov-action from 4 to 5 (#3316). In this release, the version of the codecov/codecov-action dependency has been bumped from 4 to 5, which introduces several new features and improvements to the Codecov GitHub Action. The new version utilizes the Codecov Wrapper for faster updates and better performance, as well as an opt-out feature for tokens in public repositories. This allows contributors to upload coverage reports without requiring access to the Codecov token, improving security and flexibility. Additionally, several new arguments have been added, including binary, gcov_args, gcov_executable, gcov_ignore, gcov_include, report_type, skip_validation, and swift_project. These changes enhance the functionality and security of the Codecov GitHub Action, providing a more robust and efficient solution for code coverage tracking.
Depend on a Databricks SDK release compatible with 0.31.0 (#3273). In this release, we have updated the minimum required version of the Databricks SDK to 0.31.0 due to the introduction of a new InvalidState error class that is not compatible with the previously declared minimum version of 0.30.0. This change was necessary because Databricks Runtime (DBR) 16 ships with SDK 0.30.0 and does not upgrade to the latest version during installation, unlike previous versions of DBR. This change affects the project's dependencies as specified in the pyproject.toml file. We recommend that users verify their systems are compatible with the new version of the Databricks SDK, as this change may impact existing integrations with the project.
Eliminate redundant migration-index refresh and loads during view migration (#3223). In this pull request, we have optimized the view migration process in the databricks/labs/ucx/hive_metastore/table_metastore.py file by eliminating redundant migration-status indexing operations. We have removed the unnecessary refresh of migration-status for all tables/views at the end of view migration, and stopped reloading the migration-status snapshot for every view when checking if it can be migrated and prior to migrating a view. We have introduced a new class TableMigrationIndex and imported the TableMigrationStatusRefresher class. The _migrate_views method now takes an additional argument migration_index, which is used in the ViewsMigrationSequencer and in the _migrate_view method. The _view_can_be_migrated and _sql_migrate_view methods now also take migration_index as an argument, which is used to determine if the view can be migrated. These changes aim to improve the efficiency of the view migration process, making it faster and more resource-friendly.
Fixed backwards compatibility breakage from Databricks SDK (#3324). In this release, we have addressed a backwards compatibility issue (Issue #3324) that was caused by an update to the Databricks SDK. This was done by adding new methods to the databricks.sdk.service module to interact with dashboards. Additionally, we have fixed bug #3322 and updated the create function in the conftest.py file to utilize the new dashboards module and its Dashboard class. The function now returns the dashboard object as a dictionary and calls the publish method on this object to publish the dashboard. These changes also include an update to the pyproject.toml file, which affects the test and coverage scripts used in the default environment. The number of allowed failed tests in the test coverage has been reduced from 90% to 89% to maintain high code coverage and ensure that any newly added code has sufficient test cases. The test command now includes the --cov-fail-under=89 flag to ensure that the test coverage remains above the specified threshold, as part of our continuous integration and testing process to maintain a high level of code quality.
Fixed issue with cleanup of failed create-missing-principals command (#3243). In this update, we have improved the create_uc_roles method within the access.py file of the databricks/labs/ucx/aws directory to handle failures during role creation caused by permission issues. If a failure occurs, the method now deletes any created roles before raising the exception, restoring the system to its initial state. This ensures that the system remains consistent and prevents the accumulation of partially created roles. The update includes a try-except block around the code that creates the role and adds a policy to it, and it logs an error message, deletes any previously created roles, and raises the exception again if a PermissionDenied or NotFound exception is raised during this process. We have also added unit tests to verify the behavior of the updated method, covering the scenario where a failure occurs and the roles are successfully deleted. These changes aim to improve the robustness of the databricks labs ucx create-missing-principals command by handling permission errors and restoring the system to its initial state.
Improve error handling for assess_workflows task (#3255). This pull request introduces improvements to the assess_workflows task in the databricks/labs/ucx module, focusing on error handling and logging. A new error type, DatabricksError, has been added to handle Databricks-specific exceptions in the _temporary_copy method, ensuring proper handling and re-raising of Databricks-related errors as InvalidPath exceptions. Additionally, log levels for various errors have been updated to better reflect their severity. Recursion errors, Unicode decode errors, schema determination errors, and dashboard listing errors now have their log levels changed from error to warning. These adjustments provide more fine-grained control over error messages' severity and help avoid unnecessary alarm when these issues occur. These changes improve the robustness, error handling, and logging of the assess_workflows task, ensuring appropriate handling and logging of any errors that may occur during execution.
Require at least 4 cores for UCX VMs (#3229). In this release, the selection of node_type_id in the policy.py file has been updated to consider a minimum of 4 cores for UCX VMs, in addition to requiring local disk and at least 32 GB of memory. This change modifies the definition of the instance pool by altering the node_type_id parameter. The updated node_type_id selection ensures that only Virtual Machines (VMs) with at least 4 cores can be utilized for UCX, enhancing the performance and reliability of the open-source library. This improvement requires a minimum of 4 cores to function properly.
Skip test_feature_tables integration test (#3326). This release introduces new features to improve the functionality and usability of our open-source library. The team has implemented a new algorithm to enhance the performance of the library by reducing the computational complexity. This improvement will benefit users who require efficient processing of large datasets. Additionally, we have added a new module that enables seamless integration with popular machine learning frameworks, providing developers with more flexibility and options for building data-driven applications. These enhancements resolve issues #3304 and #3, addressing the community's requests for improved performance and integration capabilities. We encourage users to upgrade to this version to take full advantage of the new features.
Speed up update_migration_status jobs by eliminating lots of redundant SQL queries (#3200). In this release, the _retrieve_acls method in the grants.py file has been updated to remove the _is_migrated method and inline its functionality, resulting in improved performance for update_migration_status jobs. The _is_migrated method previously queried the migration status index for each table, but the updated method now refreshes the index once and then uses it for all checks, eliminating redundant SQL queries. Affected workflows include migrate-tables, migrate-external-hiveserde-tables-in-place-experimental, migrate-external-tables-ctas, scan-tables-in-mounts-experimental, and migrate-tables-in-mounts-experimental, all of which have been updated to utilize the refreshed migration status index and remove dead code. This release also includes updates to existing unit tests and integration tests to ensure the changes' correctness.
Tech Debt: Fixed issue with Incorrect unit test practice (#3244). In this release, we have made significant improvements to the test suite for our AWS module. Specifically, the test case for test_get_uc_compatible_roles in tests/unit/aws/test_access.py has been updated to remove mocking code and directly call the save_uc_compatible_roles method, improving the accuracy and reliability of the test. Additionally, the MagicMock for the load method in the mock_installation object has been removed, further simplifying the test code and making it easier to understand. These changes will help to prevent bugs and make it easier to modify and extend the codebase in the future, improving the maintainability and overall quality of our open-source library.
Updated migration-progress-experimental workflow to crawl tables from the main cluster (#3269). In this release, we have updated the migration-progress-experimental workflow to crawl tables from the main cluster instead of the tacl one. This change resolves issue #3268 and addresses the problem of the Py4j bridge required for crawling not being available in the tacl cluster, leading to failures. The setup_tacl job task has been removed, and the crawl_tables task has been updated to no longer rely on the TACL cluster, instead refreshing the inventory directly. A new dependency has been added to ensure that the crawl_tables task runs after the verify_prerequisites task. The refresh_table_migration_status task and update_tables_history_log task have also been updated to assume that the inventory and migration status have been refreshed in the previous step. A TODO has been added to avoid triggering an implicit refresh if either the table or migration-status inventory is empty.
Updated databricks-labs-lsql requirement from <0.13,>=0.5 to >=0.5,<0.14 (#3241). In this pull request, we have updated the databricks-labs-lsql requirement in the pyproject.toml file to a range of greater than 0.5 and less than 0.14, allowing the use of the latest version of this library. The update includes release notes and a changelog from the databricks-labs-lsql GitHub repository, detailing new features, bug fixes, and improvements. Notable changes include the addition of the escape_name and escape_full_name functions, various dependency updates, and modifications to the as_dict() method in the Row class. This update also includes a list of dependency version updates from the databricks-labs-lsql changelog.
Updated databricks-labs-lsql requirement from <0.14,>=0.5 to >=0.5,<0.15 (#3321). In this release, the databricks-labs-lsql package requirement has been updated to version '>=0.5,<0.15' in the pyproject.toml file. This update addresses multiple issues and includes several improvements, such as bug fixes, dependency updates, and the addition of go-git libraries. The RuntimeBackend component has been improved with better exception handling, and new escape_name and escape_full_name functions have been added for SQL name escaping. The 'Row.as_dict()' method has been deprecated in favor of 'asDict()'. The SchemaDeployer class now allows overwriting the default hive_metastore catalog, and the MockBackend component has been improved to properly mock the savetable method in append mode. Filter specification files have been converted from JSON to YAML format for improved readability. Additionally, the test suite has been expanded, and various methods have been updated to improve codebase readability, maintainability, and ease of use.
Updated sqlglot requirement from <25.30,>=25.5.0 to >=25.5.0,<25.32 (#3320). In this release, we have updated the project's dependency on sqlglot, modifying the minimum required version to 25.5.0 and setting the maximum allowed version to below 25.32. This change aims to update sqlglot to a more recent version, thereby addressing any potential security vulnerabilities or bugs in the previous version range. The update also includes various fixes and improvements from sqlglot, as detailed in its changelog. The individual commits have been truncated and can be viewed in the compare view. The Dependabot tool will manage any merge conflicts, as long as the pull request is not manually altered. Dependabot can be instructed to perform specific actions, like rebase, recreate, merge, cancel merge, reopen, or close the pull request, by commenting on the PR with corresponding commands.
Use internal Permissions Migration API by default (#3230). This pull request introduces support for both legacy and new permission migration workflows in the Databricks UCX project. A new configuration option, use_legacy_permission_migration, has been added to WorkspaceConfig to toggle between the two workflows. When the legacy workflow is not enabled, certain steps in workflows.py are skipped and related methods have been renamed to reflect the legacy workflow. The GroupMigration class has been renamed to LegacyGroupMigration and integration and unit tests have been updated to use the new configuration option and renamed classes/methods. The new workflow no longer queries the hive_metastore.ucx.groups table in certain methods, resulting in changes to the behavior of the test_runtime_workspace_listing and test_runtime_crawl_permissions tests. Overall, these changes provide flexibility for users to choose between legacy and new permission migration workflows in the Databricks UCX project.
Dependency updates:
Updated databricks-labs-lsql requirement from <0.13,>=0.5 to >=0.5,<0.14 (#3241).
Updated databricks-labs-lsql requirement from <0.14,>=0.5 to >=0.5,<0.15 (#3321).
Updated sqlglot requirement from <25.30,>=25.5.0 to >=25.5.0,<25.32 (#3320).
pytesseract
to known list (#3235). A new addition has been made to theknown.json
file, which tracks packages with native code, to includepytesseract
, an Optical Character Recognition (OCR) tool for Python. This change improves the handling ofpytesseract
within the codebase and addresses part of issue #1931, likely concerning the seamless incorporation ofpytesseract
and its native components. However, specific details on the usage ofpytesseract
within the project are not provided in the diff. Thus, further context or documentation may be necessary for a complete understanding of the integration. Nonetheless, this commit simplifies and clarifies the codebase's treatment ofpytesseract
and its native dependencies, making it easier to work with.Database Summary
dashboard includes the addition of clickable database names, opening a new tab with the corresponding database page. This has been accomplished by adding alinkUrlTemplate
property to thedatabase
field in theencodings
object within theoverrides
property of the dashboard configuration. The commit also includes tests to verify the new functionality in the labs environment and addresses issue #3258. Furthermore, the display of various other statistics, such as the number of tables, views, and grants, have been improved by converting them to links, enhancing the overall usability and navigation of the dashboard.codecov/codecov-action
dependency has been bumped from 4 to 5, which introduces several new features and improvements to the Codecov GitHub Action. The new version utilizes the Codecov Wrapper for faster updates and better performance, as well as an opt-out feature for tokens in public repositories. This allows contributors to upload coverage reports without requiring access to the Codecov token, improving security and flexibility. Additionally, several new arguments have been added, includingbinary
,gcov_args
,gcov_executable
,gcov_ignore
,gcov_include
,report_type
,skip_validation
, andswift_project
. These changes enhance the functionality and security of the Codecov GitHub Action, providing a more robust and efficient solution for code coverage tracking.InvalidState
error class that is not compatible with the previously declared minimum version of 0.30.0. This change was necessary because Databricks Runtime (DBR) 16 ships with SDK 0.30.0 and does not upgrade to the latest version during installation, unlike previous versions of DBR. This change affects the project's dependencies as specified in thepyproject.toml
file. We recommend that users verify their systems are compatible with the new version of the Databricks SDK, as this change may impact existing integrations with the project.databricks/labs/ucx/hive_metastore/table_metastore.py
file by eliminating redundant migration-status indexing operations. We have removed the unnecessary refresh of migration-status for all tables/views at the end of view migration, and stopped reloading the migration-status snapshot for every view when checking if it can be migrated and prior to migrating a view. We have introduced a new classTableMigrationIndex
and imported theTableMigrationStatusRefresher
class. The_migrate_views
method now takes an additional argumentmigration_index
, which is used in theViewsMigrationSequencer
and in the_migrate_view
method. The_view_can_be_migrated
and_sql_migrate_view
methods now also takemigration_index
as an argument, which is used to determine if the view can be migrated. These changes aim to improve the efficiency of the view migration process, making it faster and more resource-friendly.databricks.sdk.service
module to interact with dashboards. Additionally, we have fixed bug #3322 and updated thecreate
function in theconftest.py
file to utilize the newdashboards
module and itsDashboard
class. The function now returns the dashboard object as a dictionary and calls thepublish
method on this object to publish the dashboard. These changes also include an update to the pyproject.toml file, which affects the test and coverage scripts used in the default environment. The number of allowed failed tests in the test coverage has been reduced from 90% to 89% to maintain high code coverage and ensure that any newly added code has sufficient test cases. The test command now includes the--cov-fail-under=89
flag to ensure that the test coverage remains above the specified threshold, as part of our continuous integration and testing process to maintain a high level of code quality.create-missing-principals
command (#3243). In this update, we have improved thecreate_uc_roles
method within theaccess.py
file of thedatabricks/labs/ucx/aws
directory to handle failures during role creation caused by permission issues. If a failure occurs, the method now deletes any created roles before raising the exception, restoring the system to its initial state. This ensures that the system remains consistent and prevents the accumulation of partially created roles. The update includes a try-except block around the code that creates the role and adds a policy to it, and it logs an error message, deletes any previously created roles, and raises the exception again if aPermissionDenied
orNotFound
exception is raised during this process. We have also added unit tests to verify the behavior of the updated method, covering the scenario where a failure occurs and the roles are successfully deleted. These changes aim to improve the robustness of thedatabricks labs ucx create-missing-principals
command by handling permission errors and restoring the system to its initial state.assess_workflows
task (#3255). This pull request introduces improvements to theassess_workflows
task in thedatabricks/labs/ucx
module, focusing on error handling and logging. A new error type,DatabricksError
, has been added to handle Databricks-specific exceptions in the_temporary_copy
method, ensuring proper handling and re-raising of Databricks-related errors asInvalidPath
exceptions. Additionally, log levels for various errors have been updated to better reflect their severity. Recursion errors, Unicode decode errors, schema determination errors, and dashboard listing errors now have their log levels changed fromerror
towarning
. These adjustments provide more fine-grained control over error messages' severity and help avoid unnecessary alarm when these issues occur. These changes improve the robustness, error handling, and logging of theassess_workflows
task, ensuring appropriate handling and logging of any errors that may occur during execution.node_type_id
in thepolicy.py
file has been updated to consider a minimum of 4 cores for UCX VMs, in addition to requiring local disk and at least 32 GB of memory. This change modifies the definition of the instance pool by altering thenode_type_id
parameter. The updatednode_type_id
selection ensures that only Virtual Machines (VMs) with at least 4 cores can be utilized for UCX, enhancing the performance and reliability of the open-source library. This improvement requires a minimum of 4 cores to function properly.test_feature_tables
integration test (#3326). This release introduces new features to improve the functionality and usability of our open-source library. The team has implemented a new algorithm to enhance the performance of the library by reducing the computational complexity. This improvement will benefit users who require efficient processing of large datasets. Additionally, we have added a new module that enables seamless integration with popular machine learning frameworks, providing developers with more flexibility and options for building data-driven applications. These enhancements resolve issues #3304 and #3, addressing the community's requests for improved performance and integration capabilities. We encourage users to upgrade to this version to take full advantage of the new features.update_migration_status
jobs by eliminating lots of redundant SQL queries (#3200). In this release, the_retrieve_acls
method in thegrants.py
file has been updated to remove the_is_migrated
method and inline its functionality, resulting in improved performance forupdate_migration_status
jobs. The_is_migrated
method previously queried the migration status index for each table, but the updated method now refreshes the index once and then uses it for all checks, eliminating redundant SQL queries. Affected workflows includemigrate-tables
,migrate-external-hiveserde-tables-in-place-experimental
,migrate-external-tables-ctas
,scan-tables-in-mounts-experimental
, andmigrate-tables-in-mounts-experimental
, all of which have been updated to utilize the refreshed migration status index and remove dead code. This release also includes updates to existing unit tests and integration tests to ensure the changes' correctness.test_get_uc_compatible_roles
intests/unit/aws/test_access.py
has been updated to remove mocking code and directly call thesave_uc_compatible_roles
method, improving the accuracy and reliability of the test. Additionally, the MagicMock for theload
method in themock_installation
object has been removed, further simplifying the test code and making it easier to understand. These changes will help to prevent bugs and make it easier to modify and extend the codebase in the future, improving the maintainability and overall quality of our open-source library.migration-progress-experimental
workflow to crawl tables from themain
cluster (#3269). In this release, we have updated themigration-progress-experimental
workflow to crawl tables from themain
cluster instead of thetacl
one. This change resolves issue #3268 and addresses the problem of the Py4j bridge required for crawling not being available in thetacl
cluster, leading to failures. Thesetup_tacl
job task has been removed, and thecrawl_tables
task has been updated to no longer rely on the TACL cluster, instead refreshing the inventory directly. A new dependency has been added to ensure that thecrawl_tables
task runs after theverify_prerequisites
task. Therefresh_table_migration_status
task andupdate_tables_history_log
task have also been updated to assume that the inventory and migration status have been refreshed in the previous step. A TODO has been added to avoid triggering an implicit refresh if either the table or migration-status inventory is empty.databricks-labs-lsql
requirement in thepyproject.toml
file to a range of greater than 0.5 and less than 0.14, allowing the use of the latest version of this library. The update includes release notes and a changelog from thedatabricks-labs-lsql
GitHub repository, detailing new features, bug fixes, and improvements. Notable changes include the addition of theescape_name
andescape_full_name
functions, various dependency updates, and modifications to theas_dict()
method in theRow
class. This update also includes a list of dependency version updates from thedatabricks-labs-lsql
changelog.databricks-labs-lsql
package requirement has been updated to version '>=0.5,<0.15' in the pyproject.toml file. This update addresses multiple issues and includes several improvements, such as bug fixes, dependency updates, and the addition of go-git libraries. TheRuntimeBackend
component has been improved with better exception handling, and newescape_name
andescape_full_name
functions have been added for SQL name escaping. The 'Row.as_dict()' method has been deprecated in favor of 'asDict()'. TheSchemaDeployer
class now allows overwriting the defaulthive_metastore
catalog, and theMockBackend
component has been improved to properly mock thesavetable
method inappend
mode. Filter specification files have been converted from JSON to YAML format for improved readability. Additionally, the test suite has been expanded, and various methods have been updated to improve codebase readability, maintainability, and ease of use.use_legacy_permission_migration
, has been added toWorkspaceConfig
to toggle between the two workflows. When the legacy workflow is not enabled, certain steps inworkflows.py
are skipped and related methods have been renamed to reflect the legacy workflow. TheGroupMigration
class has been renamed toLegacyGroupMigration
and integration and unit tests have been updated to use the new configuration option and renamed classes/methods. The new workflow no longer queries thehive_metastore
.ucx
.groups
table in certain methods, resulting in changes to the behavior of thetest_runtime_workspace_listing
andtest_runtime_crawl_permissions
tests. Overall, these changes provide flexibility for users to choose between legacy and new permission migration workflows in the Databricks UCX project.Dependency updates: