databrickslabs / dbx

🧱 Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.
https://dbx.readthedocs.io
Other
440 stars 120 forks source link

Error logs on Databricks Runtime 11.3 LTS do not display correctly. #590

Open Squaess opened 1 year ago

Squaess commented 1 year ago

Expected Behavior

When looking at the Databricsk UI the error message with the stack trace is displayed in a clear way.

Current Behavior

When running the job on the Databricks 11.3 LTS Runtime the error message in the UI contains ANSI escape characters:

== SQL ==
this table doesn't exist
-----^^^
---------------------------------------------------------------------------
ParseException                            Traceback (most recent call last)
<command--1> in <cell line: 13>()
     12 
     13 with open(filename, "rb") as f:
---> 14   exec(compile(f.read(), filename, 'exec'))
     15 

/tmp/tmpxxsbdj9b.py in <module>
      7 
      8 if __name__ == "__main__":
----> 9     entrypoint()

/tmp/tmpxxsbdj9b.py in entrypoint()
      4 def entrypoint():
      5     spark = SparkSession.builder.getOrCreate()
----> 6     spark.table("this table doesn't exist")
      7 
      8 if __name__ == "__main__":

/databricks/spark/python/pyspark/instrumentation_utils.py in wrapper(*args, **kwargs)
     46             start = time.perf_counter()
     47             try:
---> 48                 res = func(*args, **kwargs)
     49                 logger.log_success(
     50                     module_name, class_name, function_name, time.perf_counter() - start, signature

/databricks/spark/python/pyspark/sql/session.py in table(self, tableName)
   1138         True
   1139         """
-> 1140         return DataFrame(self._jsparkSession.table(tableName), self)
   1141 
   1142     @property

/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1319 
   1320         answer = self.gateway_client.send_command(command)
-> 1321         return_value = get_return_value(
   1322             answer, self.gateway_client, self.target_id, self.name)
   1323 

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    200                 # Hide where the exception came from that shows a non-Pythonic
    201                 # JVM exception message.
--> 202                 raise converted from None
    203             else:
    204                 raise

ParseException: 
[PARSE_SYNTAX_ERROR] Syntax error at or near 'table'(line 1, pos 5)

== SQL ==
this table doesn't exist
-----^^^

Steps to Reproduce (for bugs)

  1. Create script which fails and configure it to run on 2 different Databricks Runtime versions: 11.3 LTS and 10.4 LTS.
  2. Run the dbx deploy command.
  3. Execute the workflow using the UI.
  4. Observe how the error message is displayed.

Context

I've noticed that there is a problem when showing the error message in Databricks Runtime 11.3 LTS. In order to verify this here is an example setup:

Parts of the deployment file:

custom:
  cluster-11-3: &cluster-11-3
    new_cluster:
      spark_version: "11.3.x-scala2.12"
      num_workers: 1
      node_type_id: "i3.xlarge"
      aws_attributes:
        ...[REDACTED]...

  cluster-10-4: &cluster-10-4
    new_cluster:
      spark_version: "10.4.x-scala2.12"
      num_workers: 1
      node_type_id: "i3.xlarge"
      aws_attributes:
        ...[REDACTED]...

build:
  no_build: true

environments:
  default:
    workflows:
      - name: "run-python-task"
        tasks:
          - task_key: "run-11-3"
            <<: *cluster-11-3
            spark_python_task:
              python_file: "file://cicd_sample_project/main.py"
              parameters: []
          - task_key: "run-10-4"
            <<: *cluster-10-4
            spark_python_task:
              python_file: "file://cicd_sample_project/main.py"
              parameters: []

Content of the cicd_sample_project/main.py file:

from pyspark.sql import SparkSession

def entrypoint():
    spark = SparkSession.builder.getOrCreate()
    spark.table("this table doesn't exist")

if __name__ == "__main__":
    entrypoint()

setup.py file:

"""
This file configures the Python package with entrypoints used for future runs on Databricks.

Please follow the `entry_points` documentation for more details on how to configure the entrypoint:
* https://setuptools.pypa.io/en/latest/userguide/entry_point.html
"""

from setuptools import find_packages, setup
from cicd_sample_project import __version__

PACKAGE_REQUIREMENTS = ["pyyaml"]

# packages for local development and unit testing
# please note that these packages are already available in DBR, there is no need to install them on DBR.
LOCAL_REQUIREMENTS = [
    "pyspark==3.2.1",
    "delta-spark==1.1.0",
]

TEST_REQUIREMENTS = [
    # development & testing tools
    "dbx>=0.8,<0.9"
]

setup(
    name="cicd_sample_project",
    packages=find_packages(exclude=["tests", "tests.*"]),
    setup_requires=["setuptools","wheel"],
    install_requires=PACKAGE_REQUIREMENTS,
    extras_require={"local": LOCAL_REQUIREMENTS, "test": TEST_REQUIREMENTS},
    entry_points = {
        "console_scripts": [
    ]},
    version=__version__,
    description="",
    author="",
)

Your Environment

renardeinside commented 1 year ago

hi @Squaess , thanks a lot for opening the issue. I'll try to repo and see what causes it.

goldstein0101 commented 1 year ago

Same thing happens for 11.0, 11.1 ML DBR