Nike-Inc / brickflow

Pythonic Programming Framework to orchestrate jobs in Databricks Workflow
https://engineering.nike.com/brickflow/
Apache License 2.0
183 stars 36 forks source link

[BUG] Cluster `policy_id` converted to scientific notation #106

Closed maxim-mityutko closed 4 months ago

maxim-mityutko commented 4 months ago

Describe the bug Brickflow generates bundle.yaml file that is used to create Terraform manifests. YAML loader derives data types based on the value content, unless quotation marks are provided which forces the value to be interpretted as string.

My team got the cluster policy_id which can be interpreted as a number in scientific notation: 0016792308389E31. So when trying to deploy the worklow with such policy to Databricks, the following error is returned:

Error: cannot create job: '1.6792308389e+41' is not a valid cluster policy ID.

Technically not only policy_id is succeptable to this error, but any value that fits the above criteria that is being dumped into YAML.

To Reproduce Steps to reproduce the behavior:

  1. Create a workflow that has a policy compatible with scientific number notation.
  2. Try local deployment

Expected behavior String values are enforced in the bundle.yaml with quotation marks.

Cloud Information

Desktop (please complete the following information):

Additional context The culprit of the problem is: https://github.com/Nike-Inc/brickflow/blob/b07ebfb88517d03dc0191cf587a84b7e6e54b82d/brickflow/codegen/databricks_bundle.py#L669-L670

We tried adding leading / trailing spaces to the policy_id or use extra quoatation marks with / without escape charecters to try and enforce string behaviour, but it did not help, because DBX then interprets it as a completely different policy.

The solution is to force yaml.dump to add quotation marks while exporting the data.

Consider the example:

import yaml

from pydantic import BaseModel
class TestModel(BaseModel):
    policy_id: str
    number: int

t = TestModel(
    policy_id="0016792308389E31",
    number=10
)

print(yaml.dump(t.dict()))

def quoted_presenter(dumper, data):
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')

yaml.add_representer(str, quoted_presenter)

print(yaml.dump(t.dict()))

And the output before and after applying the custom representer:

number: 10
policy_id: 0016792308389E31

"number": 10
"policy_id": "0016792308389E31"