Your best companion for upgrading to Unity Catalog. UCX will guide you, the Databricks customer, through the process of upgrading your account, groups, workspaces, jobs etc. to Unity Catalog.
Added handling for exceptions with no error_code attribute while crawling permissions (#2079). A new enhancement has been implemented to improve error handling during the assessment job's permission crawling process. Previously, exceptions that lacked an error_code attribute would cause an AttributeError. This release introduces a check for the existence of the error_code attribute before attempting to access it, logging an error and adding it to the list of acute errors if not present. The change includes a new unit test for verification, and the relevant functionality has been added to the inventorize_permissions function within the manager.py file. The new method, test_manager_inventorize_fail_with_error, has been implemented to test the permission manager's behavior when encountering errors during the inventory process, raising DatabricksError and TimeoutError instances with and without error_code attributes. This update resolves issue #2078 and enhances the overall robustness of the assessment job's permission crawling functionality.
Added handling for missing permission to read file (#1949). In this release, we've addressed an issue where missing permissions to read a file during linting were not being handled properly. The revised code now checks for NotFound and PermissionError exceptions when attempting to read a file's text content. If a NotFound exception occurs, the function returns None and logs a warning message. If a PermissionError exception occurs, the function also returns None and logs a warning message with the error's traceback. This change resolves issue #1942 and partially resolves issue #1952, improving the robustness of the linting process and providing more informative error messages. Additionally, new tests and methods have been added to handle missing files and missing read permissions during linting, ensuring that the file linter can handle these cases correctly.
Added handling for unauthenticated exception while joining collection (#1958). A new exception type, Unauthenticated, has been added to the import statement, and new error messages have been implemented in the _sync_collection and _get_collection_workspace functions to notify users when they do not have admin access to the workspace. A try-except block has been added in the _get_collection_workspace function to handle the Unauthenticated exception, and a warning message is logged indicating that the user needs account admin and workspace admin credentials to enable collection joining and to run the join-collection command with account admin credentials. Additionally, a new CLI command has been added, and the existing databricks labs ucx ... command has been modified. A new workflow for joining the collection has also been implemented. These changes have been thoroughly documented in the user documentation and verified on the staging environment.
Added tracking for UCX workflows and as-library usage (#1966). This commit introduces User-Agent tracking for UCX workflows and library usage, adding ucx/<version>, cmd/install, and cmd/<workflow> elements to relevant requests. These changes are implemented within the test_useragent.py file, which includes the new http_fixture_server context manager for testing User-Agent propagation in UCX workflows. The addition of with_user_agent_extra and the inclusion of with_product functions from databricks.sdk.core aim to provide valuable insights for debugging, maintenance, and improving UCX workflow performance. This feature will help gather clear usage metrics for UCX and enhance the overall user experience.
Analyse altair (#2005). In this release, the open-source library has undergone a whitelisting of the altair library, addressing issue #1901. The changes involve the addition of several modules and sub-modules under the altair package, including altair, altair._magics, altair.expr, and various others such as altair.utils, altair.utils._dfi_types, altair.utils._importers, and altair.utils._show. Additionally, modifications have been made to the known.json file to include the altair package. It is important to note that no new functionalities have been introduced, and the changes have been manually verified. This release has been developed by Eric Vergnaud.
Analyse azure (#2016). In this release, we have made updates to the whitelist of several Azure libraries, including 'azure-common', 'azure-core', 'azure-mgmt-core', 'azure-mgmt-digitaltwins', and 'azure-storage-blob'. These changes are intended to manage dependencies and ensure a secure and stable environment for software engineers working with these libraries. The azure-common library has been added to the whitelist, and updates have been made to the existing whitelists for the other libraries. These changes do not add or modify any functionality or test cases, but are important for maintaining the integrity of our open-source library. This commit was co-authored by Eric Vergnaud from Databricks.
Analyse causal-learn (#2012). In this release, we have added causal-learn to the whitelist in our JSON file, signifying that it is now a supported library. This update includes the addition of various modules, classes, and functions to 'causal-learn'. We would like to emphasize that there are no changes to existing functionality, nor have any new methods been added. This release is thoroughly tested to ensure functionality and stability. We hope that software engineers in the community will find this update helpful and consider adopting this project.
Analyse databricks-arc (#2004). This release introduces whitelisting for the databricks-arc library, which is used for data analytics and machine learning. The release updates the known.json file to include databricks-arc and its related modules such as arc.autolinker, arc.sql, arc.sql.enable_arc, arc.utils, and arc.utils.utils. It also provides specific error codes and messages related to using these libraries on UC Shared Clusters. Additionally, this release includes updates to the databricks-feature-engineering library, with the addition of many new modules and error codes related to JVM access, legacy context, and spark logging. The databricks.ml_features library has several updates, including changes to the _spark_client and publish_engine. The databricks.ml_features.entities module has many updates, with new classes and methods for handling features, specifications, tables, and more. These updates offer improved functionality and error handling for the whitelisted libraries, specifically when used on UC Shared Clusters.
Analyse dbldatagen (#1985). The dbldatagen package has been whitelisted in the known.json file in this release. While there are no new or altered functionalities, several updates have been made to the methods and objects within dbldatagen. This includes enhancements to dbldatagen._version, dbldatagen.column_generation_spec, dbldatagen.column_spec_options, dbldatagen.constraints, dbldatagen.data_analyzer, dbldatagen.data_generator, dbldatagen.datagen_constants, dbldatagen.datasets, and related classes. Additionally, dbldatagen.datasets.basic_geometries, dbldatagen.datasets.basic_process_historian, dbldatagen.datasets.basic_telematics, dbldatagen.datasets.basic_user, dbldatagen.datasets.benchmark_groupby, dbldatagen.datasets.dataset_provider, dbldatagen.datasets.multi_table_telephony_provider, and dbldatagen.datasets_object have been updated. The distribution methods, such as dbldatagen.distributions, dbldatagen.distributions.beta, dbldatagen.distributions.data_distribution, dbldatagen.distributions.exponential_distribution, dbldatagen.distributions.gamma, and dbldatagen.distributions.normal_distribution, have also seen improvements. Furthermore, dbldatagen.function_builder, dbldatagen.html_utils, dbldatagen.nrange, dbldatagen.schema_parser, dbldatagen.spark_singleton, dbldatagen.text_generator_plugins, and dbldatagen.text_generators have been updated. The dbldatagen.data_generator method now includes a warning about the deprecated sparkContext in shared clusters, and dbldatagen.schema_parser includes updates related to the table_name argument in various SQL statements. These changes ensure better compatibility and improved functionality of the dbldatagen package.
Analyse delta-spark (#1987). In this release, the delta-spark component within the delta project has been whitelisted with the inclusion of a new entry in the known.json configuration file. This addition brings in several sub-components, including delta._typing, delta.exceptions, and delta.tables, each with a jvm-access-in-shared-clusters error code and message for unsupported environments. These changes aim to enhance the handling of delta-spark component within the delta project. The changes have been rigorously tested and do not introduce new functionality or modify existing behavior. This update is ensured to provide better stability and compatibility to the project. Co-authored by Eric Vergnaud.
Analyse diffusers (#2010). A new diffusers category has been added to the JSON configuration file, featuring several subcategories and numerous empty arrays as values. This change serves to prepare the configuration for future additions, without altering any existing methods or behaviors. As such, this update does not impact current functionality, but instead, sets the stage for further development. No associated tests or functional changes accompany this modification.
Analyse faker (#2014). In this release, the faker library in the Databricks project has undergone whitelisting, addressing security concerns, improving performance, and reducing the attack surface. No new methods were added, and the existing functionality remains unchanged. Thorough manual verification of the tests has been conducted. This release introduces various modules and submodules related to the faker library, expanding its capabilities in address generation in multiple languages and countries, along with new providers for bank, barcode, color, company, credit_card, currency, date_time, emoji, file, geo, internet, isbn, job, lorem, misc, passport, person, phone_number, profile, python, sbn, ssn, and user_agent generation. Software engineers should find these improvements advantageous for their projects, offering a broader range of options and enhanced performance.
Analyse fastcluster (#1980). In this release, the project's configuration has been updated to include the fastcluster package in the approved libraries whitelist, as part of issue #1901 resolution. This change enables software engineers to utilize the functions and methods provided by fastcluster in the project's codebase. The fastcluster package is now registered in the known.json configuration file, and its integration has been thoroughly tested to ensure seamless functionality. By incorporating fastcluster, the project's capabilities are expanded, allowing software engineers to benefit from its optimized clustering algorithms and performance enhancements.
Analyse glow (#1973). In this release, we have analyzed and added the glow library and its modules, including glow._array, glow._coro, glow._debug, and others, to the known.json file whitelist. This change allows for seamless integration and usage of the glow library in your projects. It is important to note that this update does not modify any existing functionality and has been thoroughly tested to ensure compatibility. Software engineers utilizing the glow library will benefit from this enhancement, as it provides explicit approval for the library and its modules, facilitating a more efficient development process.
Analyse graphframes (#1990). In this release, the graphframes library has been thoroughly analyzed and the whitelist updated accordingly. This includes the addition of several new entries, such as graphframes.examples.belief_propagation, graphframes.examples.graphs, graphframes.graphframe, graphframes.lib.aggregate_messages, and graphframes.tests. These changes may require modifications such as rewriting code to use Spark or accessing the Spark Driver JVM. These updates aim to improve compatibility with UC Shared Clusters, ensuring a more seamless integration. Manual testing has been conducted to ensure the changes are functioning as intended.
Analyse graphviz (#2008). In this release, we have analyzed and whitelisted the graphviz library for use in the project. The library has been added to the known.json file, which is used to manage dependencies. The graphviz package contains several modules and sub-modules, including backend, dot, exceptions, graphs, jupyter_integration, parameters, rendering, and saving. While we do not have detailed information on the functionality provided by these modules at this time, they have been manually tested for correct functioning. This addition enhances the project's graphing and visualization capabilities by incorporating the well-regarded graphviz library.
Analyse hyperopt (#1970). In this release, we have made changes to include the hyperopt library in our project, addressing issue #1901. This integration does not introduce any new methods or modify existing functionality, and has been manually tested. The hyperopt package now includes several new modules, such as hyperopt.algobase, hyperopt.anneal, hyperopt.atpe, and many others, encompassing various components like classes, functions, and tests. Notably, some of these modules support integration with Spark and MongoDB. The known.json file has also been updated to reflect these additions.
Analyse ipywidgets (#1972). A new commit has been added to whitelist the ipywidgets package, enabling its usage within our open-source library. No new functionality or changes have been introduced in this commit. The package has undergone manual testing to ensure proper functionality. The primary modification involves adding ipywidgets to the known.json file whitelist, which includes various modules and sub-modules used for testing, IPython interaction, handling dates and times, and managing widget outputs. This update simply permits the utilization of the ipywidgets package and its related modules and sub-modules.
Analyse johnsnowlabs (#1997). The johnsnowlabs package, used for natural language processing and machine learning tasks, has been added to the whitelist in this release. This package includes various modules and sub-packages, such as auto_install, finance, frameworks, johnsnowlabs, lab, legal, llm, medical, nlp, py_models, serve, settings, utils, and visual, which provide a range of classes and functions for working with data and models in the context of NLP and machine learning. Note that this commit also raises deprecation warnings related to file system paths and access to the Spark Driver JVM in shared clusters, indicating potential compatibility issues or limitations; however, the exact impact or scope of these issues cannot be determined from the provided commit message.
Analyse langchain (#1975). In this release, the langchain module has been added to the JSON file and whitelisted for use. This module encompasses a variety of sub-modules, such as '_api', '_api.deprecation', '_api.interactive_env', and '_api.module_import', among others. Additionally, there are sub-modules related to adapters for various services, including 'openai', 'amadeus', 'azure_cognitive_services', 'conversational_retrieval', and 'clickup'. The conversational_retrieval sub-module contains a toolkit for openai functions and a standalone tool. However, specific changes, functionality details, and testing information have not been provided in the commit message. As a software engineer, please refer to the documentation and testing framework for further details.
Analyse lifelines (#2006). In this release, we have whitelisted the lifelines package, a powerful Python library for survival analysis and hazard rate estimation. This addition brings a comprehensive suite of functionalities, such as data sets, exceptions, utilities, version checking, statistical calculations, and plotting tools. The fitters category is particularly noteworthy, providing numerous classes for fitting various survival models, including Aalen's Additive Fitter, Cox proportional hazards models, Exponential Fitter, Generalized Gamma Fitter, Kaplan-Meier Fitter, Log-Logistic Fitter, Log-Normal Fitter, Mixture Cure Fitter, Nelson-Aalen Fitter, Piecewise Exponential Fitter, and Weibull Fitter. By whitelisting this library, users can now leverage its capabilities to enhance their projects with advanced survival analysis features.
Analyse megatron (#1982). In this release, we have made updates to the known.json file to include the whitelisting of the megatron module. While there are no new functional changes or accompanying tests for this update, it is important to note the addition of new keys to the known.json file, which is used to specify approved modules and functions in the codebase. The added keys for megatron include megatron.io, megatron.layers, megatron.nodes, megatron.utils, and megatron.visuals. These additions will ensure that any code referencing these modules or functions will not be flagged as unknown or unapproved, promoting a consistent and manageable codebase. This update is particularly useful in larger projects where keeping track of approved modules and functions can be challenging. For more information, please refer to linked issue #1901.
Analyse numba (#1978). In this release, we have added Numba, a just-in-time compiler for Python, to our project's whitelist. This addition is reflected in the updated JSON file that maps package names to package versions, which now includes various Numba modules such as 'numba.core', 'numba.cuda', and 'numba.np', along with their respective submodules and functions. Numba is now available for import and will be used in the project, enhancing the performance of our Python code. The new entries in the JSON file have been manually verified, and no changes to existing functionality have been made.
Analyse omegaconf (#1992). This commit introduces omegaconf, a configuration library that provides a simple and flexible way to manage application configurations, to the project's whitelist, which was reviewed and approved by Eric Vergnaud. The addition of omegaconf and its various modules, including base, base container, dict config, error handling, grammar, list config, nodes, resolver, opaque container, and versioning modules, as well as plugins for pydevd, enables the project to utilize this library for configuration management. No existing functionality is affected, and no new methods have been added. This change is limited to the addition of omegaconf to the whitelist and the inclusion of its modules, and it has been manually tested. Overall, this change allows the project to leverage the omegaconf library to enhance the management of application configurations.
Analyse patool (#1988). In this release, we have made changes to the src/databricks/labs/ucx/source_code/known.json file by whitelisting patool. This change, related to issue #1901, does not introduce any new functionality but adds an entry for patool along with several new keys corresponding to various utilities and programs associated with it. The whitelisting process has been carried out manually, and the changes have been thoroughly tested to ensure their proper functioning. This update is targeted towards software engineers seeking to enhance their understanding of the library's modifications. Co-authored by Eric Vergnaud.
Analyse peft (#1994). In this release, we've added the peft key and its associated modules to the 'known.json' file located in the 'databricks/labs/ucx/source_code' directory. The peft module includes several sub-modules, such as 'peft.auto', 'peft.config', 'peft.helpers', 'peft.import_utils', 'peft.mapping', 'peft.mixed_model', 'peft.peft_model', and 'peft.tuners', among others. The 'peft.tuners' module implements various tuning strategies for machine learning models and includes sub-modules like 'peft.tuners.adalora', 'peft.tuners.adaption_prompt', 'peft.tuners.boft', 'peft.tuners.ia3', 'peft.tuners.ln_tuning', 'peft.tuners.loha', 'peft.tuners.lokr', 'peft.tuners.lora', 'peft.tuners.multitask_prompt_tuning', 'peft.tuners.oft', 'peft.tuners.p_tuning', 'peft.tuners.poly', 'peft.tuners.prefix_tuning', 'peft.tuners.prompt_tuning', 'peft.tuners.vera', and 'peft.utils', which contains several utility functions. This addition provides new functionalities for machine learning model tuning and utility functions to the project.
Analyse seaborn (#1977). In this release, the open-source library's dependency whitelist has been updated to include 'seaborn'. This enables the library to utilize seaborn in the project. Furthermore, several Azure libraries such as azure-cosmos and azure-storage-blob have been updated to their latest versions. Additionally, numerous other libraries such as 'certifi', 'cffi', 'charset-normalizer', 'idna', 'numpy', 'pandas', 'pycparser', 'pyOpenSSL', 'python-dateutil', 'pytz', 'requests', 'six', urllib3 have also been updated to their latest versions. However, issue #1901 is still a work in progress and does not include any specific functional changes or tests in this release.
Analyse shap (#1993). A new commit by Eric Vergnaud has been added to the project, whitelisting the Shap library for use. Shap is an open-source library that provides explanations for the output of machine learning models. This commit integrates several of Shap's modules into our project, enabling their import without any warnings. The inclusion of these modules does not affect existing functionalities, ensuring a smooth and stable user experience. This update enhances our project's capabilities by providing a more comprehensive explanation of machine learning model outputs, thanks to the integration of the Shap library.
Analyse sklearn (#1979). In this release, we have added sklearn to the whitelist in the known.json file as part of issue #190
Analyse sktime (#2007). In this release, we've expanded our machine learning capabilities by adding the sktime library to our whitelist. Sktime is a library specifically designed for machine learning on time series data, and includes components for preprocessing, modeling, and evaluation. This addition includes a variety of directories and modules related to time series analysis, such as distances and kernels, network architectures, parameter estimation, performance metrics, pipelines, probability distributions, and more. Additionally, we've added tests for many of these modules to ensure proper functionality. Furthermore, we've also added the smmap library to our whitelist, providing a drop-in replacement for the built-in python file object, which allows random access to large files that are too large to fit into memory. These additions will enable our software to handle larger datasets and perform advanced time series analysis.
Analyse spark-nlp (#1981). In this release, the open-source spark-nlp library has been added to the whitelist, enhancing compatibility and accessibility for software engineers. The addition of spark-nlp to the whitelist is a non-functional change, but it is expected to improve the overall integration with other libraries. This change has been thoroughly tested to ensure compatibility and reliability, making it a valuable addition for developers working with this library.
Analyse spark-ocr (#2011). A new open-source library, spark-ocr, has been added to the recognized and supported libraries within the system, following the successful whitelisting in the known.json file. This change, tracking issue #1901, does not introduce new functionality or modify existing features but enables all methods and functionality associated with spark-ocr for usage. The software engineering team has manually tested the integration, ensuring the seamless adoption for engineers incorporating this project. Please note that specific details of the spark-ocr methods are not provided in the commit message. This development benefits software engineers seeking to utilize the spark-ocr library within the project.
Analyse tf-quant-finance (#2015). In this release, we are excited to announce the whitelisting of the tf-quant-finance library, a comprehensive and versatile toolkit for financial modeling and analysis. This open-source library brings a wide range of functionalities to our project, including various numerical methods such as finite difference, integration, and interpolation, as well as modules for financial instruments, pricing platforms, stochastic volatility models, and rate curves. The library also includes modules for mathematical functions, optimization, and root search, enhancing our capabilities in these areas. Furthermore, tf-quant-finance provides a variety of finance models, such as Cox-Ingersoll-Ross (CIR), Heston, Hull-White, SABR, and more, expanding our repertoire of financial models. Lastly, the library includes modules for rates, such as constant forward, Hagan-West, and Nelson-Siegel-Svensson models, providing more options for rate modeling. We believe that this addition will significantly enhance our project's capabilities and enable us to tackle more complex financial modeling tasks with ease.
Analyse trl (#1998). In this release, we have integrated the trl library into our project, which is a tool for training, running, and logging AI models. This inclusion is aimed at addressing issue #1901. The trl library has been whitelisted in the known.json file, resulting in extensive changes to the file. While no new functionality has been introduced in this commit, the trl library provides various methods for running and training models, as well as utilities for CLI scripts and environment setup. These changes have been manually tested by our team, including Eric Vergnaud. We encourage software engineers to explore the new library and use it to enhance the project's capabilities.
Analyse unstructured (#2013). This release includes the addition of new test cases for various modules and methods within the unstructured library, such as chunking, cleaners, documents, embed, file_utils, metrics, nlp, partition, staging, and unit_utils. The test cases cover a range of functionalities, including HTML and PDF parsing, text extraction, embedding, file conversion, and encoding detection. The goal is to improve the library's overall robustness and reliability by increasing test coverage for different components.
Dashboard: N/A instead of NULL readiness while assessment job hasn't yet provided any data (#1910). In this release, we have improved the behavior of the readiness counter on the workspace UC readiness dashboard. Previously, if the assessment job did not provide any data, the readiness counter would display a NULL value, which could be confusing for users. With this change, the readiness counter now displays 'N/A' instead of NULL in such cases. This behavior is implemented by modifying the SELECT statement in the 00_0_compatibility.sql file, specifically the calculation of the readiness counter. The COALESCE function is used to return 'N/A' if the result of the calculation is NULL. This enhancement ensures that users are not confused by the presence of a NULL value when there is no data available yet.
Do not migrate READ_METADATA to BROWSE on tables and schemas (#2022). A recent change has been implemented in the open-source library concerning the handling of the READ_METADATA privilege for tables and schemas during migration from hive_metastore to UC. This change omits the translation of READ_METADATA privilege to BROWSE privilege on UC tables and schemas due to UC's support for BROWSE privilege only on catalog objects. Failing to make this change would result in error messages during the migrate tables workflow logs, causing confusion for users. Relevant code modifications have been made in the uc_grant_sql method in the grants.py file, where lines for TABLE and DATABASE with READ_METADATA privilege have been removed. Additionally, tests have been updated in the test_grants.py file to reflect these changes, avoiding the granting of unsupported privileges and preventing user confusion.
Exclude VIEW from "Non-DELTA format: UNKNOWN" findings in assessment summary chart (#2025). This release includes updates to the assessment main dashboard's assessment summary chart, specifically addressing the "Non-DELTA format: UNKNOWN" finding. Previously, views were mistakenly included in this finding, causing confusion for customers who couldn't locate any unknown format tables. The issue has been resolved by modifying a SQL file to filter results based on object type and table format, ensuring that non-DELTA format tables are only included if the object type is not a view. This enhancement prevents views from being erroneously counted in the "Non-DELTA format: UNKNOWN" finding, providing clearer and more accurate assessment results for users.
Explain unused variable (#1946). In this release, the make_dbfs_data_copy fixture in our open-source library has been updated to address an unused variable issue related to the _ variable, which was previously assigned the value of make_cluster but was not utilized in the fixture. This change was implemented on April 16th, and it was only recently identified by make fmt. Additionally, the fixture now includes an if statement that initializes a CommandExecutor object to execute commands on the cluster if the workspace configuration is on AWS. These updates improve the code's readability and maintainability, ensuring that it functions optimally for our software engineer users.
Expose code linters as a LSP plugin (#1921). UCX has added a PyLSP plugin for its code linters, which will be automatically registered when python-lsp-server is installed. This integration allows users to utilize code linters without any additional setup, improving the code linter functionality of UCX by enabling it to be used as an LSP plugin and providing separate linters and fixers for Python and SQL. The changes include a new Failure class, an updated Deprecation class, and a pylsp_lint function implemented using the pylsp library to lint the code. The LinterContext and Diagnostic classes have been imported, and the pylsp_lint function takes in a Workspace and Document object. The associated tests have been updated, including manual testing, unit tests, and tests on the staging environment. The new feature also includes methods to lint code for use in UC Shared Clusters and return diagnostic information about any issues found, which can serve as a guide for users to rewrite their code as needed.
Fixed grant visibility and classification (#1911). This pull request introduces changes to the grants function in the grants.py file, addressing issues with grant visibility and classification in the underlying inventory. The _crawl function has been updated to distinguish between tables and views, and a new dictionary, _grants_reported_as, has been added to map reported object types for grants to their actual types. The grants function now includes a modification to normalize object types using the new dictionary. The assessment workflow and the grant_detail view have also been modified. The changes to the grants function may affect grant classification and display, and it is recommended to review relevant user documentation for accuracy. Additionally, tests have been conducted to ensure functionality, including unit tests, integration tests, and manual testing. No new methods have been added, but existing functionality in the _crawl method in the tables.py file has been changed.
Fixed substituting regex with empty string (#1953). This release includes a fix for issue #1922 where regular expressions were being replaced with empty strings, causing problems in the assesment.crawl_groups and migrate-groups workflows. The groups.py file has been modified to include changes to the GroupMigrationStrategy classes, such as the addition of workspace_group_regex and account_group_regex attributes, and their compiled versions. The __init__ method for RegexSubStrategy and RegexMatchStrategy now takes these regex arguments. The _safe_match method now takes a regex pattern instead of a string, and the _safe_sub method takes a compiled regex pattern and replacement string as arguments. The ConfigureGroups class includes a new _valid_substitute_pattern attribute and an updated _is_valid_substitute_str method to validate the substitution string. The new RegexSubStrategy method replaces the name of the group in the workspace with an empty string when matched by the specified regex. Unit tests and manual testing have been conducted to ensure the correct functionality of these changes.
Group migration: continue permission migration even if one or more groups fails (#1924). This update introduces changes to the group migration process, specifically the permission migration stage. If an error occurs during the migration of a group's permissions, the migration will continue with the next group, and any errors will be raised as a ManyError exception at the end. The information about successful and failed groups is currently only logged, not persisted. The group-migration workflow now includes a new class, ManyError, and a new method, apply_permissions, in the PermissionsMigrationAPI class, handling the migration of permissions for a group and raising a ManyError exception if necessary. The commit also includes modified unit tests to ensure the proper functioning of the updated workflow. These changes aim to improve the robustness and reliability of the group migration process by allowing it to continue in the face of errors and by providing better error handling and reporting.
Group renaming: wait for consistency before completing task (#1944). In this release, we have made significant updates to the group-migration workflow in databricks/labs/ucx/workspace_access/groups.py to ensure that group renaming is completed before the task is marked as done. This change was made to address the issue of eventual consistency in group renaming, which could cause downstream tasks to encounter problems. We have added unit tests for various scenarios, including the snapshot_with_group_created_in_account_console_should_be_considered, rename_groups_should_patch_eligible_groups, rename_groups_should_wait_for_renames_to_complete, rename_groups_should_retry_on_internal_error, and rename_groups_should_fail_if_unknown_name_observed cases. The rename_groups_should_wait_for_renames_to_complete test uses a mock time.sleep function to simulate the passage of time and verifies that the group renaming operation waits for the rename to be detected. Additionally, the rename_groups_should_retry_on_internal_error test uses a mock WorkspaceClient object to simulate an internal error and verifies that the group renaming operation retries the failed operation. The rename_groups_should_fail_if_unknown_name_observed test simulates a situation where a concurrent process is interfering with the group renaming operation and verifies that the operation fails immediately instead of waiting for a timeout to occur. These updates are crucial for ensuring the reliability and consistency of group renaming operations in our workflow.
Improved support for magic commands in python cells (#1905). This commit enhances support for magic commands in python cells, specifically %pip and !pip, by improving parsing and execution of cells containing magic lines and ensuring proper pip dependency handling. It includes changes to existing commands, workflows, and the addition of new ones, as well as a new table and classes such as DependencyProblem and MagicCommand. The PipCell class has been updated to PythonCell. New methods build_dependency_graph and convert_magic_lines_to_magic_commands have been added, and several tests have been updated and added to ensure functionality. The changes have been unit and integration tested and manually verified on the staging environment.
Include findings on DENY grants during assessment (#1903). This pull request introduces support for flagging DENY permissions on objects that cannot be migrated to Unity Catalog (UC). It includes modifications to the grant_detail view and adds new integration tests for existing grant-scanning, resolving issue #1869 and superseding #1890. A new column, failures, has been added to the grant_detail view to indicate explicit DENY privileges that are not supported in UC. The assessment workflow has been updated to include a new step that identifies incompatible object privileges, while new and existing methods have been updated to support flagging DENY permissions. The changes have been documented for users, and the assessment workflow and related SQL queries have been updated accordingly. The PR also clarifies that no new CLI command has been added, and no existing commands or tables have been modified. Tests have been conducted manually and integration tests have been added to ensure the changes work as expected.
Infer linted values that resolve to dbutils.widgets.get (#1891). This change includes several updates to improve handling of linter context and session state in dependency graphs, as well as enhancements to the inference of values for dbutils.widgets.get calls. The linter_context_factory method now includes a new parameter, session_state, which defaults to None. The LocalFileMigrator and LocalCodeLinter classes use a lambda function to call linter_context_factory with the session_state parameter, and the DependencyGraph class includes a new method, CurrentSessionState, to extract SysPathChange from the tree. The get_notebook_paths method now accepts a CurrentSessionState parameter, and the build_local_file_dependency_graph method has been updated to accept this parameter as well. These changes enhance the flexibility of the linter context and improve the accuracy of dbutils.widgets.get value inference.
Infer values across notebook cells (#1968). This commit introduces a new feature to the linter that infers values across notebook cells when linting Python code, resolving 60 out of 891 cannot be computed advices. The changes include the addition of new classes PythonLinter and PythonSequentialLinter, as well as the modification of the Fixer class to accept a list of Linter instances as input. The updated linter takes into account not only the code from the current cell but also the code from previous cells, improving value inference and accuracy during linting. The changes have been manually tested and accompanied by added unit tests. This feature progresses issues #1912 and #1205.
Log the right amount of lint problems (#2024). A fix has been implemented to address an issue with the incorrect reporting of lint problems due to a change in #1956. The logger now accurately reports the number of linting problems found during the execution of linting tasks in parallel. The length of job_problems is now calculated after flattening the list, resulting in a more precise count. This improvement enhances the reliability of the linting process, ensuring that users are informed of the correct number of issues present in their code.
Normalize python code before parsing (#1918). This commit addresses the issue of copy-pasted Python code failing to parse and lint due to illegal leading spaces. Co-authored by Eric Vergnaud, it introduces normalization of code through the new normalize_and_parse method in the Tree class, which first normalizes the code by removing illegal leading spaces and then parses it. This change improves the code linter's ability to handle previously unparseable code and does not affect functionality. New unit tests have been added to ensure correctness, and modifications to the PythonCell and PipMagic classes enhance processing and handling of multiline code, magic commands, and pip commands. The pull request also includes a new test to check if the normalization process ignores magic markers in multiline comments, improving the reliability of parsing and linting copy-pasted Python code.
Prompt about joining a collection of ucx installs early (#1963). The databricks labs install ucx command has been updated to prompt the user early on to join a collection of UCX installs. Users who are not account admins can now enter their workspace ID to join as a collection, or skip joining if they prefer. This change includes modifications to the join_collection method to include a prompt message and handle cases where the user is not an account admin. A PermissionDenied exception has been added for users who do not have account admin permissions and cannot list workspaces. This change was made to streamline the installation process and reduce potential confusion for users. Additionally, tests have been conducted, both manually and through existing unit tests, to ensure the proper functioning of the updated command. This modification was co-authored by Serge Smertin and is intended to improve the overall user experience.
Raise lint errors after persisting workflow problems in the inventory database (#1956). The refresh_report method in jobs.py has been updated to raise lint errors after persisting workflow problems in the inventory database. This change includes adding a new import statement for ManyError and modifying the existing import statement for Threads from databricks.labs.blueprint.parallel. The method signature for Threads.strict has been changed to Threads.gather with a new argument 'linting workflows'. The problems list has been replaced with a job_problems, errors tuple, and the job_problems list is flattened using itertools.chain before writing it to the inventory database. If there are any errors during the execution of tasks, a ManyError exception is raised with the list of errors. This development helps to visualize known workflow problems by raising lint errors after persisting them in the inventory database, addressing issue #1952, and has been manually tested for accuracy.
Removing the workspace network requirement info in README.md (#1948). In this release, we have improved the installation process of UCX, an open-source tool used for deploying assets to selected workspaces. Previously, the requirement for the workspace network to have access to pypi.org for downloading certain packages has been removed and addressed in a previous issue. Now, UCX can be installed in the /Applications/ucx directory, which is a change from the previous location of /Users/<your user>/.ucx/. This update simplifies the installation process and enhances the user experience. Software engineers who are already familiar with UCX and its installation process will benefit from this update. For advanced installation instructions, please refer to the corresponding section in the documentation.
Use dedicated advice code for uncomputed values (#2019). This commit introduces dedicated advice codes for handling uncomputed values in various scenarios, enhancing error messages and improving the precision of feedback provided during the linting process. Changes include implementing notebook-run-cannot-compute-value to replace dbutils-notebook-run-dynamic in the _raise_advice_if_unresolved function, providing more accurate and specific information when the path for 'dbutils.notebook.run' cannot be computed. A new advice code table-migrate-cannot-compute-value has been added to indicate that a table name argument cannot be computed during linting. Additionally, the new advice code sys-path-cannot-compute-value is used in the dependency resolver, replacing the previous sys-path-cannot-compute code. These updates lead to more precise and informative error messages, aiding in debugging processes. No new methods have been added, and existing functionality remains unchanged. Unit tests have been executed, and they passed. These improvements target software engineers looking to benefit from more accurate error messages and better guidance for debugging.
Use dedicated advice code for unsupported sql (#2018). In the latest commit, Eric Vergnaud introduced a new advice code sql-query-unsupported-sql for unsupported SQL queries in the lint function of the queries.py file. This change is aimed at handling unsupported SQL gracefully, providing a more specific error message compared to the previous generic table-migrate advice code. Additionally, an exception for unsupported SQL has been implemented in the linter for DBFS, utilizing a new code 'dbfs-query-unsupported-sql'. This modification is intended to improve the handling of SQL queries that are not currently supported, potentially aiding in better integration with future SQL parsing tools. However, it should be noted that this change has not been tested.
catch sqlglot exceptions and convert them to advices (#1915). In this release, SQL parsing errors are now handled using SQLGlot and converted to Failure advices, with the addition of unit tests and refactoring of the affected code block. A new Failure exception class has been introduced in the databricks.labs.ucx.source_code.base module, which is used when a SQL query cannot be parsed by sqlglot. A change in the behavior of the SQL parser now generates a Failure object instead of silently returning an empty list when sqlglot fails to process a query. This change enhances transparency in error handling and helps developers understand when and why a query has failed to parse. The commit progresses issue #1901 and is co-authored by Eric Vergnaud and Andrew Snare.
Dependency updates:
Updated sqlglot requirement from <25.1,>=23.9 to >=23.9,<25.2 (#1904).
Updated sqlglot requirement from <25.2,>=23.9 to >=23.9,<25.3 (#1917).
Updated databricks-sdk requirement from <0.29,>=0.27 to >=0.27,<0.30 (#1943).
Updated sqlglot requirement from <25.3,>=23.9 to >=25.4.1,<25.5 (#1959).
Updated databricks-labs-lsql requirement from ~=0.4.0 to >=0.4,<0.6 (#2076).
Updated sqlglot requirement from <25.5,>=25.4.1 to >=25.5.0,<25.6 (#2084).
error_code
attribute would cause anAttributeError
. This release introduces a check for the existence of theerror_code
attribute before attempting to access it, logging an error and adding it to the list of acute errors if not present. The change includes a new unit test for verification, and the relevant functionality has been added to theinventorize_permissions
function within themanager.py
file. The new method,test_manager_inventorize_fail_with_error
, has been implemented to test the permission manager's behavior when encountering errors during the inventory process, raisingDatabricksError
andTimeoutError
instances with and withouterror_code
attributes. This update resolves issue #2078 and enhances the overall robustness of the assessment job's permission crawling functionality.NotFound
andPermissionError
exceptions when attempting to read a file's text content. If aNotFound
exception occurs, the function returns None and logs a warning message. If aPermissionError
exception occurs, the function also returns None and logs a warning message with the error's traceback. This change resolves issue #1942 and partially resolves issue #1952, improving the robustness of the linting process and providing more informative error messages. Additionally, new tests and methods have been added to handle missing files and missing read permissions during linting, ensuring that the file linter can handle these cases correctly.databricks labs ucx ...
command has been modified. A new workflow for joining the collection has also been implemented. These changes have been thoroughly documented in the user documentation and verified on the staging environment.ucx/<version>
,cmd/install
, andcmd/<workflow>
elements to relevant requests. These changes are implemented within thetest_useragent.py
file, which includes the newhttp_fixture_server
context manager for testing User-Agent propagation in UCX workflows. The addition ofwith_user_agent_extra
and the inclusion ofwith_product
functions fromdatabricks.sdk.core
aim to provide valuable insights for debugging, maintenance, and improving UCX workflow performance. This feature will help gather clear usage metrics for UCX and enhance the overall user experience.altair
(#2005). In this release, the open-source library has undergone a whitelisting of thealtair
library, addressing issue #1901. The changes involve the addition of several modules and sub-modules under thealtair
package, includingaltair
,altair._magics
,altair.expr
, and various others such asaltair.utils
,altair.utils._dfi_types
,altair.utils._importers
, andaltair.utils._show
. Additionally, modifications have been made to theknown.json
file to include thealtair
package. It is important to note that no new functionalities have been introduced, and the changes have been manually verified. This release has been developed by Eric Vergnaud.azure
(#2016). In this release, we have made updates to the whitelist of several Azure libraries, including 'azure-common', 'azure-core', 'azure-mgmt-core', 'azure-mgmt-digitaltwins', and 'azure-storage-blob'. These changes are intended to manage dependencies and ensure a secure and stable environment for software engineers working with these libraries. Theazure-common
library has been added to the whitelist, and updates have been made to the existing whitelists for the other libraries. These changes do not add or modify any functionality or test cases, but are important for maintaining the integrity of our open-source library. This commit was co-authored by Eric Vergnaud from Databricks.causal-learn
(#2012). In this release, we have addedcausal-learn
to the whitelist in our JSON file, signifying that it is now a supported library. This update includes the addition of various modules, classes, and functions to 'causal-learn'. We would like to emphasize that there are no changes to existing functionality, nor have any new methods been added. This release is thoroughly tested to ensure functionality and stability. We hope that software engineers in the community will find this update helpful and consider adopting this project.databricks-arc
(#2004). This release introduces whitelisting for thedatabricks-arc
library, which is used for data analytics and machine learning. The release updates theknown.json
file to includedatabricks-arc
and its related modules such asarc.autolinker
,arc.sql
,arc.sql.enable_arc
,arc.utils
, andarc.utils.utils
. It also provides specific error codes and messages related to using these libraries on UC Shared Clusters. Additionally, this release includes updates to thedatabricks-feature-engineering
library, with the addition of many new modules and error codes related to JVM access, legacy context, and spark logging. Thedatabricks.ml_features
library has several updates, including changes to the_spark_client
andpublish_engine
. Thedatabricks.ml_features.entities
module has many updates, with new classes and methods for handling features, specifications, tables, and more. These updates offer improved functionality and error handling for the whitelisted libraries, specifically when used on UC Shared Clusters.dbldatagen
(#1985). Thedbldatagen
package has been whitelisted in theknown.json
file in this release. While there are no new or altered functionalities, several updates have been made to the methods and objects withindbldatagen
. This includes enhancements todbldatagen._version
,dbldatagen.column_generation_spec
,dbldatagen.column_spec_options
,dbldatagen.constraints
,dbldatagen.data_analyzer
,dbldatagen.data_generator
,dbldatagen.datagen_constants
,dbldatagen.datasets
, and related classes. Additionally,dbldatagen.datasets.basic_geometries
,dbldatagen.datasets.basic_process_historian
,dbldatagen.datasets.basic_telematics
,dbldatagen.datasets.basic_user
,dbldatagen.datasets.benchmark_groupby
,dbldatagen.datasets.dataset_provider
,dbldatagen.datasets.multi_table_telephony_provider
, anddbldatagen.datasets_object
have been updated. The distribution methods, such asdbldatagen.distributions
,dbldatagen.distributions.beta
,dbldatagen.distributions.data_distribution
,dbldatagen.distributions.exponential_distribution
,dbldatagen.distributions.gamma
, anddbldatagen.distributions.normal_distribution
, have also seen improvements. Furthermore,dbldatagen.function_builder
,dbldatagen.html_utils
,dbldatagen.nrange
,dbldatagen.schema_parser
,dbldatagen.spark_singleton
,dbldatagen.text_generator_plugins
, anddbldatagen.text_generators
have been updated. Thedbldatagen.data_generator
method now includes a warning about the deprecatedsparkContext
in shared clusters, anddbldatagen.schema_parser
includes updates related to thetable_name
argument in various SQL statements. These changes ensure better compatibility and improved functionality of thedbldatagen
package.delta-spark
(#1987). In this release, thedelta-spark
component within thedelta
project has been whitelisted with the inclusion of a new entry in theknown.json
configuration file. This addition brings in several sub-components, includingdelta._typing
,delta.exceptions
, anddelta.tables
, each with ajvm-access-in-shared-clusters
error code and message for unsupported environments. These changes aim to enhance the handling ofdelta-spark
component within thedelta
project. The changes have been rigorously tested and do not introduce new functionality or modify existing behavior. This update is ensured to provide better stability and compatibility to the project. Co-authored by Eric Vergnaud.diffusers
(#2010). A newdiffusers
category has been added to the JSON configuration file, featuring several subcategories and numerous empty arrays as values. This change serves to prepare the configuration for future additions, without altering any existing methods or behaviors. As such, this update does not impact current functionality, but instead, sets the stage for further development. No associated tests or functional changes accompany this modification.faker
(#2014). In this release, thefaker
library in the Databricks project has undergone whitelisting, addressing security concerns, improving performance, and reducing the attack surface. No new methods were added, and the existing functionality remains unchanged. Thorough manual verification of the tests has been conducted. This release introduces various modules and submodules related to thefaker
library, expanding its capabilities in address generation in multiple languages and countries, along with new providers for bank, barcode, color, company, credit_card, currency, date_time, emoji, file, geo, internet, isbn, job, lorem, misc, passport, person, phone_number, profile, python, sbn, ssn, and user_agent generation. Software engineers should find these improvements advantageous for their projects, offering a broader range of options and enhanced performance.fastcluster
(#1980). In this release, the project's configuration has been updated to include thefastcluster
package in the approved libraries whitelist, as part of issue #1901 resolution. This change enables software engineers to utilize the functions and methods provided byfastcluster
in the project's codebase. Thefastcluster
package is now registered in theknown.json
configuration file, and its integration has been thoroughly tested to ensure seamless functionality. By incorporatingfastcluster
, the project's capabilities are expanded, allowing software engineers to benefit from its optimized clustering algorithms and performance enhancements.glow
(#1973). In this release, we have analyzed and added theglow
library and its modules, includingglow._array
,glow._coro
,glow._debug
, and others, to theknown.json
file whitelist. This change allows for seamless integration and usage of theglow
library in your projects. It is important to note that this update does not modify any existing functionality and has been thoroughly tested to ensure compatibility. Software engineers utilizing theglow
library will benefit from this enhancement, as it provides explicit approval for the library and its modules, facilitating a more efficient development process.graphframes
(#1990). In this release, thegraphframes
library has been thoroughly analyzed and the whitelist updated accordingly. This includes the addition of several new entries, such asgraphframes.examples.belief_propagation
,graphframes.examples.graphs
,graphframes.graphframe
,graphframes.lib.aggregate_messages
, andgraphframes.tests
. These changes may require modifications such as rewriting code to use Spark or accessing the Spark Driver JVM. These updates aim to improve compatibility with UC Shared Clusters, ensuring a more seamless integration. Manual testing has been conducted to ensure the changes are functioning as intended.graphviz
(#2008). In this release, we have analyzed and whitelisted thegraphviz
library for use in the project. The library has been added to theknown.json
file, which is used to manage dependencies. Thegraphviz
package contains several modules and sub-modules, includingbackend
,dot
,exceptions
,graphs
,jupyter_integration
,parameters
,rendering
, andsaving
. While we do not have detailed information on the functionality provided by these modules at this time, they have been manually tested for correct functioning. This addition enhances the project's graphing and visualization capabilities by incorporating the well-regardedgraphviz
library.hyperopt
(#1970). In this release, we have made changes to include thehyperopt
library in our project, addressing issue #1901. This integration does not introduce any new methods or modify existing functionality, and has been manually tested. Thehyperopt
package now includes several new modules, such ashyperopt.algobase
,hyperopt.anneal
,hyperopt.atpe
, and many others, encompassing various components like classes, functions, and tests. Notably, some of these modules support integration with Spark and MongoDB. Theknown.json
file has also been updated to reflect these additions.ipywidgets
(#1972). A new commit has been added to whitelist theipywidgets
package, enabling its usage within our open-source library. No new functionality or changes have been introduced in this commit. The package has undergone manual testing to ensure proper functionality. The primary modification involves addingipywidgets
to theknown.json
file whitelist, which includes various modules and sub-modules used for testing, IPython interaction, handling dates and times, and managing widget outputs. This update simply permits the utilization of theipywidgets
package and its related modules and sub-modules.johnsnowlabs
(#1997). Thejohnsnowlabs
package, used for natural language processing and machine learning tasks, has been added to the whitelist in this release. This package includes various modules and sub-packages, such as auto_install, finance, frameworks, johnsnowlabs, lab, legal, llm, medical, nlp, py_models, serve, settings, utils, and visual, which provide a range of classes and functions for working with data and models in the context of NLP and machine learning. Note that this commit also raises deprecation warnings related to file system paths and access to the Spark Driver JVM in shared clusters, indicating potential compatibility issues or limitations; however, the exact impact or scope of these issues cannot be determined from the provided commit message.langchain
(#1975). In this release, thelangchain
module has been added to the JSON file and whitelisted for use. This module encompasses a variety of sub-modules, such as '_api', '_api.deprecation', '_api.interactive_env', and '_api.module_import', among others. Additionally, there are sub-modules related to adapters for various services, including 'openai', 'amadeus', 'azure_cognitive_services', 'conversational_retrieval', and 'clickup'. Theconversational_retrieval
sub-module contains a toolkit for openai functions and a standalone tool. However, specific changes, functionality details, and testing information have not been provided in the commit message. As a software engineer, please refer to the documentation and testing framework for further details.lifelines
(#2006). In this release, we have whitelisted thelifelines
package, a powerful Python library for survival analysis and hazard rate estimation. This addition brings a comprehensive suite of functionalities, such as data sets, exceptions, utilities, version checking, statistical calculations, and plotting tools. Thefitters
category is particularly noteworthy, providing numerous classes for fitting various survival models, including Aalen's Additive Fitter, Cox proportional hazards models, Exponential Fitter, Generalized Gamma Fitter, Kaplan-Meier Fitter, Log-Logistic Fitter, Log-Normal Fitter, Mixture Cure Fitter, Nelson-Aalen Fitter, Piecewise Exponential Fitter, and Weibull Fitter. By whitelisting this library, users can now leverage its capabilities to enhance their projects with advanced survival analysis features.megatron
(#1982). In this release, we have made updates to theknown.json
file to include the whitelisting of themegatron
module. While there are no new functional changes or accompanying tests for this update, it is important to note the addition of new keys to theknown.json
file, which is used to specify approved modules and functions in the codebase. The added keys formegatron
includemegatron.io
,megatron.layers
,megatron.nodes
,megatron.utils
, andmegatron.visuals
. These additions will ensure that any code referencing these modules or functions will not be flagged as unknown or unapproved, promoting a consistent and manageable codebase. This update is particularly useful in larger projects where keeping track of approved modules and functions can be challenging. For more information, please refer to linked issue #1901.numba
(#1978). In this release, we have added Numba, a just-in-time compiler for Python, to our project's whitelist. This addition is reflected in the updated JSON file that maps package names to package versions, which now includes various Numba modules such as 'numba.core', 'numba.cuda', and 'numba.np', along with their respective submodules and functions. Numba is now available for import and will be used in the project, enhancing the performance of our Python code. The new entries in the JSON file have been manually verified, and no changes to existing functionality have been made.omegaconf
(#1992). This commit introducesomegaconf
, a configuration library that provides a simple and flexible way to manage application configurations, to the project's whitelist, which was reviewed and approved by Eric Vergnaud. The addition ofomegaconf
and its various modules, including base, base container, dict config, error handling, grammar, list config, nodes, resolver, opaque container, and versioning modules, as well as plugins forpydevd
, enables the project to utilize this library for configuration management. No existing functionality is affected, and no new methods have been added. This change is limited to the addition ofomegaconf
to the whitelist and the inclusion of its modules, and it has been manually tested. Overall, this change allows the project to leverage theomegaconf
library to enhance the management of application configurations.patool
(#1988). In this release, we have made changes to thesrc/databricks/labs/ucx/source_code/known.json
file by whitelistingpatool
. This change, related to issue #1901, does not introduce any new functionality but adds an entry forpatool
along with several new keys corresponding to various utilities and programs associated with it. The whitelisting process has been carried out manually, and the changes have been thoroughly tested to ensure their proper functioning. This update is targeted towards software engineers seeking to enhance their understanding of the library's modifications. Co-authored by Eric Vergnaud.peft
(#1994). In this release, we've added thepeft
key and its associated modules to the 'known.json' file located in the 'databricks/labs/ucx/source_code' directory. Thepeft
module includes several sub-modules, such as 'peft.auto', 'peft.config', 'peft.helpers', 'peft.import_utils', 'peft.mapping', 'peft.mixed_model', 'peft.peft_model', and 'peft.tuners', among others. The 'peft.tuners' module implements various tuning strategies for machine learning models and includes sub-modules like 'peft.tuners.adalora', 'peft.tuners.adaption_prompt', 'peft.tuners.boft', 'peft.tuners.ia3', 'peft.tuners.ln_tuning', 'peft.tuners.loha', 'peft.tuners.lokr', 'peft.tuners.lora', 'peft.tuners.multitask_prompt_tuning', 'peft.tuners.oft', 'peft.tuners.p_tuning', 'peft.tuners.poly', 'peft.tuners.prefix_tuning', 'peft.tuners.prompt_tuning', 'peft.tuners.vera', and 'peft.utils', which contains several utility functions. This addition provides new functionalities for machine learning model tuning and utility functions to the project.seaborn
(#1977). In this release, the open-source library's dependency whitelist has been updated to include 'seaborn'. This enables the library to utilizeseaborn
in the project. Furthermore, several Azure libraries such asazure-cosmos
andazure-storage-blob
have been updated to their latest versions. Additionally, numerous other libraries such as 'certifi', 'cffi', 'charset-normalizer', 'idna', 'numpy', 'pandas', 'pycparser', 'pyOpenSSL', 'python-dateutil', 'pytz', 'requests', 'six',urllib3
have also been updated to their latest versions. However, issue #1901 is still a work in progress and does not include any specific functional changes or tests in this release.shap
(#1993). A new commit by Eric Vergnaud has been added to the project, whitelisting the Shap library for use. Shap is an open-source library that provides explanations for the output of machine learning models. This commit integrates several of Shap's modules into our project, enabling their import without any warnings. The inclusion of these modules does not affect existing functionalities, ensuring a smooth and stable user experience. This update enhances our project's capabilities by providing a more comprehensive explanation of machine learning model outputs, thanks to the integration of the Shap library.sklearn
(#1979). In this release, we have addedsklearn
to the whitelist in theknown.json
file as part of issue #190sktime
(#2007). In this release, we've expanded our machine learning capabilities by adding the sktime library to our whitelist. Sktime is a library specifically designed for machine learning on time series data, and includes components for preprocessing, modeling, and evaluation. This addition includes a variety of directories and modules related to time series analysis, such as distances and kernels, network architectures, parameter estimation, performance metrics, pipelines, probability distributions, and more. Additionally, we've added tests for many of these modules to ensure proper functionality. Furthermore, we've also added the smmap library to our whitelist, providing a drop-in replacement for the built-in python file object, which allows random access to large files that are too large to fit into memory. These additions will enable our software to handle larger datasets and perform advanced time series analysis.spark-nlp
(#1981). In this release, the open-sourcespark-nlp
library has been added to the whitelist, enhancing compatibility and accessibility for software engineers. The addition ofspark-nlp
to the whitelist is a non-functional change, but it is expected to improve the overall integration with other libraries. This change has been thoroughly tested to ensure compatibility and reliability, making it a valuable addition for developers working with this library.spark-ocr
(#2011). A new open-source library,spark-ocr
, has been added to the recognized and supported libraries within the system, following the successful whitelisting in the known.json file. This change, tracking issue #1901, does not introduce new functionality or modify existing features but enables all methods and functionality associated withspark-ocr
for usage. The software engineering team has manually tested the integration, ensuring the seamless adoption for engineers incorporating this project. Please note that specific details of thespark-ocr
methods are not provided in the commit message. This development benefits software engineers seeking to utilize thespark-ocr
library within the project.tf-quant-finance
(#2015). In this release, we are excited to announce the whitelisting of thetf-quant-finance
library, a comprehensive and versatile toolkit for financial modeling and analysis. This open-source library brings a wide range of functionalities to our project, including various numerical methods such as finite difference, integration, and interpolation, as well as modules for financial instruments, pricing platforms, stochastic volatility models, and rate curves. The library also includes modules for mathematical functions, optimization, and root search, enhancing our capabilities in these areas. Furthermore,tf-quant-finance
provides a variety of finance models, such as Cox-Ingersoll-Ross (CIR), Heston, Hull-White, SABR, and more, expanding our repertoire of financial models. Lastly, the library includes modules for rates, such as constant forward, Hagan-West, and Nelson-Siegel-Svensson models, providing more options for rate modeling. We believe that this addition will significantly enhance our project's capabilities and enable us to tackle more complex financial modeling tasks with ease.trl
(#1998). In this release, we have integrated thetrl
library into our project, which is a tool for training, running, and logging AI models. This inclusion is aimed at addressing issue #1901. Thetrl
library has been whitelisted in theknown.json
file, resulting in extensive changes to the file. While no new functionality has been introduced in this commit, thetrl
library provides various methods for running and training models, as well as utilities for CLI scripts and environment setup. These changes have been manually tested by our team, including Eric Vergnaud. We encourage software engineers to explore the new library and use it to enhance the project's capabilities.unstructured
(#2013). This release includes the addition of new test cases for various modules and methods within the unstructured library, such as chunking, cleaners, documents, embed, file_utils, metrics, nlp, partition, staging, and unit_utils. The test cases cover a range of functionalities, including HTML and PDF parsing, text extraction, embedding, file conversion, and encoding detection. The goal is to improve the library's overall robustness and reliability by increasing test coverage for different components.READ_METADATA
privilege for tables and schemas during migration from hive_metastore to UC. This change omits the translation ofREAD_METADATA
privilege toBROWSE
privilege on UC tables and schemas due to UC's support forBROWSE
privilege only on catalog objects. Failing to make this change would result in error messages during the migrate tables workflow logs, causing confusion for users. Relevant code modifications have been made in theuc_grant_sql
method in thegrants.py
file, where lines forTABLE
andDATABASE
withREAD_METADATA
privilege have been removed. Additionally, tests have been updated in thetest_grants.py
file to reflect these changes, avoiding the granting of unsupported privileges and preventing user confusion.make_dbfs_data_copy
fixture in our open-source library has been updated to address an unused variable issue related to the_
variable, which was previously assigned the value ofmake_cluster
but was not utilized in the fixture. This change was implemented on April 16th, and it was only recently identified bymake fmt
. Additionally, the fixture now includes anif
statement that initializes aCommandExecutor
object to execute commands on the cluster if the workspace configuration is on AWS. These updates improve the code's readability and maintainability, ensuring that it functions optimally for our software engineer users.python-lsp-server
is installed. This integration allows users to utilize code linters without any additional setup, improving the code linter functionality of UCX by enabling it to be used as an LSP plugin and providing separate linters and fixers for Python and SQL. The changes include a newFailure
class, an updatedDeprecation
class, and apylsp_lint
function implemented using thepylsp
library to lint the code. TheLinterContext
andDiagnostic
classes have been imported, and thepylsp_lint
function takes in aWorkspace
andDocument
object. The associated tests have been updated, including manual testing, unit tests, and tests on the staging environment. The new feature also includes methods to lint code for use in UC Shared Clusters and return diagnostic information about any issues found, which can serve as a guide for users to rewrite their code as needed.grants
function in thegrants.py
file, addressing issues with grant visibility and classification in the underlying inventory. The_crawl
function has been updated to distinguish between tables and views, and a new dictionary,_grants_reported_as
, has been added to map reported object types for grants to their actual types. Thegrants
function now includes a modification to normalize object types using the new dictionary. Theassessment
workflow and thegrant_detail
view have also been modified. The changes to thegrants
function may affect grant classification and display, and it is recommended to review relevant user documentation for accuracy. Additionally, tests have been conducted to ensure functionality, including unit tests, integration tests, and manual testing. No new methods have been added, but existing functionality in the_crawl
method in thetables.py
file has been changed.assesment.crawl_groups
andmigrate-groups
workflows. Thegroups.py
file has been modified to include changes to theGroupMigrationStrategy
classes, such as the addition ofworkspace_group_regex
andaccount_group_regex
attributes, and their compiled versions. The__init__
method forRegexSubStrategy
andRegexMatchStrategy
now takes these regex arguments. The_safe_match
method now takes a regex pattern instead of a string, and the_safe_sub
method takes a compiled regex pattern and replacement string as arguments. TheConfigureGroups
class includes a new_valid_substitute_pattern
attribute and an updated_is_valid_substitute_str
method to validate the substitution string. The newRegexSubStrategy
method replaces the name of the group in the workspace with an empty string when matched by the specified regex. Unit tests and manual testing have been conducted to ensure the correct functionality of these changes.ManyError
exception at the end. The information about successful and failed groups is currently only logged, not persisted. Thegroup-migration
workflow now includes a new class,ManyError
, and a new method,apply_permissions
, in thePermissionsMigrationAPI
class, handling the migration of permissions for a group and raising aManyError
exception if necessary. The commit also includes modified unit tests to ensure the proper functioning of the updated workflow. These changes aim to improve the robustness and reliability of the group migration process by allowing it to continue in the face of errors and by providing better error handling and reporting.group-migration
workflow in databricks/labs/ucx/workspace_access/groups.py to ensure that group renaming is completed before the task is marked as done. This change was made to address the issue of eventual consistency in group renaming, which could cause downstream tasks to encounter problems. We have added unit tests for various scenarios, including thesnapshot_with_group_created_in_account_console_should_be_considered
,rename_groups_should_patch_eligible_groups
,rename_groups_should_wait_for_renames_to_complete
,rename_groups_should_retry_on_internal_error
, andrename_groups_should_fail_if_unknown_name_observed
cases. Therename_groups_should_wait_for_renames_to_complete
test uses a mocktime.sleep
function to simulate the passage of time and verifies that the group renaming operation waits for the rename to be detected. Additionally, therename_groups_should_retry_on_internal_error
test uses a mockWorkspaceClient
object to simulate an internal error and verifies that the group renaming operation retries the failed operation. Therename_groups_should_fail_if_unknown_name_observed
test simulates a situation where a concurrent process is interfering with the group renaming operation and verifies that the operation fails immediately instead of waiting for a timeout to occur. These updates are crucial for ensuring the reliability and consistency of group renaming operations in our workflow.%pip
and!pip
, by improving parsing and execution of cells containing magic lines and ensuring proper pip dependency handling. It includes changes to existing commands, workflows, and the addition of new ones, as well as a new table and classes such asDependencyProblem
andMagicCommand
. ThePipCell
class has been updated toPythonCell
. New methodsbuild_dependency_graph
andconvert_magic_lines_to_magic_commands
have been added, and several tests have been updated and added to ensure functionality. The changes have been unit and integration tested and manually verified on the staging environment.DENY
grants during assessment (#1903). This pull request introduces support for flagging DENY permissions on objects that cannot be migrated to Unity Catalog (UC). It includes modifications to thegrant_detail
view and adds new integration tests for existing grant-scanning, resolving issue #1869 and superseding #1890. A new column,failures
, has been added to thegrant_detail
view to indicate explicit DENY privileges that are not supported in UC. The assessment workflow has been updated to include a new step that identifies incompatible object privileges, while new and existing methods have been updated to support flagging DENY permissions. The changes have been documented for users, and theassessment
workflow and related SQL queries have been updated accordingly. The PR also clarifies that no new CLI command has been added, and no existing commands or tables have been modified. Tests have been conducted manually and integration tests have been added to ensure the changes work as expected.dbutils.widgets.get
calls. Thelinter_context_factory
method now includes a new parameter,session_state
, which defaults toNone
. TheLocalFileMigrator
andLocalCodeLinter
classes use a lambda function to calllinter_context_factory
with thesession_state
parameter, and theDependencyGraph
class includes a new method,CurrentSessionState
, to extract SysPathChange from the tree. Theget_notebook_paths
method now accepts aCurrentSessionState
parameter, and thebuild_local_file_dependency_graph
method has been updated to accept this parameter as well. These changes enhance the flexibility of the linter context and improve the accuracy ofdbutils.widgets.get
value inference.cannot be computed
advices. The changes include the addition of new classesPythonLinter
andPythonSequentialLinter
, as well as the modification of theFixer
class to accept a list ofLinter
instances as input. The updated linter takes into account not only the code from the current cell but also the code from previous cells, improving value inference and accuracy during linting. The changes have been manually tested and accompanied by added unit tests. This feature progresses issues #1912 and #1205.job_problems
is now calculated after flattening the list, resulting in a more precise count. This improvement enhances the reliability of the linting process, ensuring that users are informed of the correct number of issues present in their code.normalize_and_parse
method in the Tree class, which first normalizes the code by removing illegal leading spaces and then parses it. This change improves the code linter's ability to handle previously unparseable code and does not affect functionality. New unit tests have been added to ensure correctness, and modifications to the PythonCell and PipMagic classes enhance processing and handling of multiline code, magic commands, and pip commands. The pull request also includes a new test to check if the normalization process ignores magic markers in multiline comments, improving the reliability of parsing and linting copy-pasted Python code.databricks labs install ucx
command has been updated to prompt the user early on to join a collection of UCX installs. Users who are not account admins can now enter their workspace ID to join as a collection, or skip joining if they prefer. This change includes modifications to thejoin_collection
method to include a prompt message and handle cases where the user is not an account admin. A PermissionDenied exception has been added for users who do not have account admin permissions and cannot list workspaces. This change was made to streamline the installation process and reduce potential confusion for users. Additionally, tests have been conducted, both manually and through existing unit tests, to ensure the proper functioning of the updated command. This modification was co-authored by Serge Smertin and is intended to improve the overall user experience.refresh_report
method injobs.py
has been updated to raise lint errors after persisting workflow problems in the inventory database. This change includes adding a new import statement forManyError
and modifying the existing import statement forThreads
fromdatabricks.labs.blueprint.parallel
. The method signature forThreads.strict
has been changed toThreads.gather
with a new argument'linting workflows'
. Theproblems
list has been replaced with ajob_problems, errors
tuple, and thejob_problems
list is flattened usingitertools.chain
before writing it to the inventory database. If there are any errors during the execution of tasks, aManyError
exception is raised with the list of errors. This development helps to visualize known workflow problems by raising lint errors after persisting them in the inventory database, addressing issue #1952, and has been manually tested for accuracy./Applications/ucx
directory, which is a change from the previous location of/Users/<your user>/.ucx/
. This update simplifies the installation process and enhances the user experience. Software engineers who are already familiar with UCX and its installation process will benefit from this update. For advanced installation instructions, please refer to the corresponding section in the documentation.notebook-run-cannot-compute-value
to replacedbutils-notebook-run-dynamic
in the _raise_advice_if_unresolved function, providing more accurate and specific information when the path for 'dbutils.notebook.run' cannot be computed. A new advice codetable-migrate-cannot-compute-value
has been added to indicate that a table name argument cannot be computed during linting. Additionally, the new advice codesys-path-cannot-compute-value
is used in the dependency resolver, replacing the previoussys-path-cannot-compute
code. These updates lead to more precise and informative error messages, aiding in debugging processes. No new methods have been added, and existing functionality remains unchanged. Unit tests have been executed, and they passed. These improvements target software engineers looking to benefit from more accurate error messages and better guidance for debugging.sql-query-unsupported-sql
for unsupported SQL queries in thelint
function of thequeries.py
file. This change is aimed at handling unsupported SQL gracefully, providing a more specific error message compared to the previous generictable-migrate
advice code. Additionally, an exception for unsupported SQL has been implemented in the linter for DBFS, utilizing a new code 'dbfs-query-unsupported-sql'. This modification is intended to improve the handling of SQL queries that are not currently supported, potentially aiding in better integration with future SQL parsing tools. However, it should be noted that this change has not been tested.Failure
advices, with the addition of unit tests and refactoring of the affected code block. A newFailure
exception class has been introduced in thedatabricks.labs.ucx.source_code.base
module, which is used when a SQL query cannot be parsed by sqlglot. A change in the behavior of the SQL parser now generates aFailure
object instead of silently returning an empty list when sqlglot fails to process a query. This change enhances transparency in error handling and helps developers understand when and why a query has failed to parse. The commit progresses issue #1901 and is co-authored by Eric Vergnaud and Andrew Snare.Dependency updates: