JenspederM / kedro-databricks

A Databricks Plugin for Kedro
MIT License
13 stars 5 forks source link

`kedro databricks init` will break an existing `conf/base/databricks.yml` file #2

Closed astrojuanlu closed 4 months ago

astrojuanlu commented 4 months ago

I was following the steps, and after kedro databricks init, kedro databricks bundle failed:

$ kedro new --starter=databricks-iris && cd ...
$ kedro databricks init
$ kedro databricks bundle
...
ConstructorError: while constructing a mapping
  in "<file>", line 1, column 1
found duplicate key default
  in "<file>", line 13, column 1

It turns out that PyYAML can read the databricks.yml file just fine:

In [9]: import yaml

In [10]: with open("conf/base/databricks.yml") as fh:
    ...:     conf = yaml.load(fh, Loader=yaml.SafeLoader)
    ...: 

In [11]: print(conf)
{'default': {'job_clusters': [{'job_cluster_key': 'default', 'new_cluster': {'spark_version': '14.3.x-scala2.12', 'node_type_id': 'Standard_D4ds_v4', 'num_workers': 1, 'spark_env_vars': {'KEDRO_LOGGING_CONFIG': '/dbfs/FileStore/test_databricks_iris/conf/logging.yml'}}}], 'tasks': [{'task_key': 'default', 'job_cluster_key': 'default'}]}}

but OmegaConf cannot:

In [12]: from omegaconf import OmegaConf

In [13]: OmegaConf.load("conf/base/databricks.yml")
---------------------------------------------------------------------------
ConstructorError                          Traceback (most recent call last)
Cell In[13], line 1
----> 1 OmegaConf.load("conf/base/databricks.yml")

File ~/.micromamba/envs/pyspark/lib/python3.11/site-packages/omegaconf/omegaconf.py:190, in OmegaConf.load(file_)
    188 if isinstance(file_, (str, pathlib.Path)):
    189     with io.open(os.path.abspath(file_), "r", encoding="utf-8") as f:
--> 190         obj = yaml.load(f, Loader=get_yaml_loader())
    191 elif getattr(file_, "read", None):
    192     obj = yaml.load(file_, Loader=get_yaml_loader())

File ~/.micromamba/envs/pyspark/lib/python3.11/site-packages/yaml/__init__.py:81, in load(stream, Loader)
     79 loader = Loader(stream)
     80 try:
---> 81     return loader.get_single_data()
     82 finally:
     83     loader.dispose()

File ~/.micromamba/envs/pyspark/lib/python3.11/site-packages/yaml/constructor.py:51, in BaseConstructor.get_single_data(self)
     49 node = self.get_single_node()
     50 if node is not None:
---> 51     return self.construct_document(node)
     52 return None

File ~/.micromamba/envs/pyspark/lib/python3.11/site-packages/yaml/constructor.py:60, in BaseConstructor.construct_document(self, node)
     58     self.state_generators = []
     59     for generator in state_generators:
---> 60         for dummy in generator:
     61             pass
     62 self.constructed_objects = {}

File ~/.micromamba/envs/pyspark/lib/python3.11/site-packages/yaml/constructor.py:413, in SafeConstructor.construct_yaml_map(self, node)
    411 data = {}
    412 yield data
--> 413 value = self.construct_mapping(node)
    414 data.update(value)

File ~/.micromamba/envs/pyspark/lib/python3.11/site-packages/omegaconf/_utils.py:144, in get_yaml_loader.<locals>.OmegaConfLoader.construct_mapping(self, node, deep)
    142         continue
    143     if key_node.value in keys:
--> 144         raise yaml.constructor.ConstructorError(
    145             "while constructing a mapping",
    146             node.start_mark,
    147             f"found duplicate key {key_node.value}",
    148             key_node.start_mark,
    149         )
    150     keys.add(key_node.value)
    151 return super().construct_mapping(node, deep=deep)

ConstructorError: while constructing a mapping
  in "/Users/juan_cano/Projects/QuantumBlackLabs/tmp/test-databricks-iris/conf/base/databricks.yml", line 1, column 1
found duplicate key default
  in "/Users/juan_cano/Projects/QuantumBlackLabs/tmp/test-databricks-iris/conf/base/databricks.yml", line 13, column 1

Here's the resulting file:

default:
    job_clusters:
    -   job_cluster_key: default
        new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: Standard_D4ds_v4
            num_workers: 1
            spark_env_vars:
                KEDRO_LOGGING_CONFIG: /dbfs/FileStore/test_databricks_iris/conf/logging.yml
    tasks:
    -   task_key: default
        job_cluster_key: default
default:
    job_clusters:
    -   job_cluster_key: default
        new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: Standard_D4ds_v4
            num_workers: 1
            spark_env_vars:
                KEDRO_LOGGING_CONFIG: /dbfs/FileStore/test_databricks_iris/conf/logging.yml
    tasks:
    -   task_key: default
        job_cluster_key: default
astrojuanlu commented 4 months ago
Environment ``` $ micromamba env export --from-history name: pyspark channels: - conda-forge dependencies: - findspark - ipython - kedro - openjdk - pyspark - python=3.11 $ micromamba env export name: pyspark channels: - conda-forge dependencies: - antlr-python-runtime=4.9.3=pyhd8ed1ab_1 - arrow=1.3.0=pyhd8ed1ab_0 - asttokens=2.4.1=pyhd8ed1ab_0 - attrs=23.2.0=pyh71513ae_0 - aws-c-auth=0.7.22=haa5a189_7 - aws-c-cal=0.7.0=h94d0942_0 - aws-c-common=0.9.23=h99b78c6_0 - aws-c-compression=0.2.18=h94d0942_7 - aws-c-event-stream=0.4.2=he43e89f_14 - aws-c-http=0.8.2=h638bd1a_4 - aws-c-io=0.14.9=hf0e3e08_5 - aws-c-mqtt=0.10.4=h80c1ce3_7 - aws-c-s3=0.6.0=h11f64d3_0 - aws-c-sdkutils=0.1.16=h94d0942_3 - aws-checksums=0.1.18=h94d0942_7 - aws-crt-cpp=0.27.2=hd9c8ee4_0 - aws-sdk-cpp=1.11.329=haf867cf_8 - azure-core-cpp=1.12.0=hd01fc5c_0 - azure-identity-cpp=1.8.0=h0a11218_1 - azure-storage-blobs-cpp=12.11.0=h77cc766_1 - azure-storage-common-cpp=12.6.0=h7024f69_1 - azure-storage-files-datalake-cpp=12.10.0=h64d02d0_1 - binaryornot=0.4.4=py_1 - brotli-python=1.1.0=py311ha891d26_1 - bzip2=1.0.8=h93a5062_5 - c-ares=1.28.1=h93a5062_0 - ca-certificates=2024.7.4=hf0a4a13_0 - cachetools=5.3.3=pyhd8ed1ab_0 - certifi=2024.7.4=pyhd8ed1ab_0 - cffi=1.16.0=py311h4a08483_0 - chardet=5.2.0=py311h267d04e_1 - charset-normalizer=3.3.2=pyhd8ed1ab_0 - click=8.1.7=unix_pyh707e725_0 - colorama=0.4.6=pyhd8ed1ab_0 - cookiecutter=2.6.0=pyhca7485f_0 - decorator=5.1.1=pyhd8ed1ab_0 - dynaconf=3.2.5=pyhd8ed1ab_0 - exceptiongroup=1.2.0=pyhd8ed1ab_2 - executing=2.0.1=pyhd8ed1ab_0 - findspark=2.0.1=pyhd8ed1ab_0 - fsspec=2024.6.1=pyhff2d567_0 - gflags=2.2.2=hc88da5d_1004 - gitdb=4.0.11=pyhd8ed1ab_0 - gitpython=3.1.43=pyhd8ed1ab_0 - glog=0.7.1=heb240a5_0 - graphlib-backport=1.0.3=pyhd8ed1ab_0 - h2=4.1.0=pyhd8ed1ab_0 - hpack=4.0.0=pyh9f0ad1d_0 - hyperframe=6.0.1=pyhd8ed1ab_0 - icu=73.2=hc8870d7_0 - idna=3.7=pyhd8ed1ab_0 - importlib-metadata=7.2.1=pyha770c72_0 - importlib-resources=6.4.0=pyhd8ed1ab_0 - importlib_resources=6.4.0=pyhd8ed1ab_0 - ipython=8.26.0=pyh707e725_0 - jedi=0.19.1=pyhd8ed1ab_0 - jinja2=3.1.4=pyhd8ed1ab_0 - kedro=0.19.6=pyhd8ed1ab_0 - kedro-datasets=4.0.0=pyhd8ed1ab_0 - krb5=1.21.3=h237132a_0 - lazy_loader=0.4=pyhd8ed1ab_0 - libabseil=20240116.2=cxx17_hebf3989_0 - libarrow=16.1.0=h71e69af_13_cpu - libarrow-acero=16.1.0=h00cdb27_13_cpu - libarrow-dataset=16.1.0=h00cdb27_13_cpu - libarrow-substrait=16.1.0=hc68f6b8_13_cpu - libblas=3.9.0=22_osxarm64_openblas - libbrotlicommon=1.1.0=hb547adb_1 - libbrotlidec=1.1.0=hb547adb_1 - libbrotlienc=1.1.0=hb547adb_1 - libcblas=3.9.0=22_osxarm64_openblas - libcrc32c=1.1.2=hbdafb3b_0 - libcurl=8.8.0=h7b6f9a7_1 - libcxx=17.0.6=h0812c0d_3 - libedit=3.1.20191231=hc8eb9b7_2 - libev=4.33=h93a5062_2 - libevent=2.1.12=h2757513_1 - libexpat=2.6.2=hebf3989_0 - libffi=3.4.2=h3422bc3_5 - libgfortran=5.0.0=13_2_0_hd922786_3 - libgfortran5=13.2.0=hf226fd6_3 - libgoogle-cloud=2.26.0=hfe08963_0 - libgoogle-cloud-storage=2.26.0=h1466eeb_0 - libgrpc=1.62.2=h9c18a4f_0 - libiconv=1.17=h0d3ecfb_2 - liblapack=3.9.0=22_osxarm64_openblas - libnghttp2=1.58.0=ha4dd798_1 - libopenblas=0.3.27=openmp_h517c56d_1 - libparquet=16.1.0=hcf52c46_13_cpu - libprotobuf=4.25.3=hbfab5d5_0 - libre2-11=2023.09.01=h7b2c953_2 - libsqlite=3.46.0=hfb93653_0 - libssh2=1.11.0=h7a5bd25_0 - libthrift=0.19.0=h026a170_1 - libutf8proc=2.8.0=h1a8c8d9_0 - libxml2=2.12.7=ha661575_1 - libzlib=1.3.1=hfb2fe0b_1 - llvm-openmp=18.1.8=hde57baf_0 - lz4-c=1.9.4=hb7217d7_0 - markdown-it-py=3.0.0=pyhd8ed1ab_0 - markupsafe=2.1.5=py311h05b510d_0 - matplotlib-inline=0.1.7=pyhd8ed1ab_0 - mdurl=0.1.2=pyhd8ed1ab_0 - more-itertools=10.3.0=pyhd8ed1ab_0 - ncurses=6.5=hb89a1cb_0 - numpy=2.0.0=py311h4268184_0 - omegaconf=2.3.0=pyhd8ed1ab_0 - openjdk=22.0.1=hbeb2e11_0 - openssl=3.3.1=hfb2fe0b_1 - orc=2.0.1=h47ade37_1 - packaging=24.1=pyhd8ed1ab_0 - pandas=2.2.2=py311h4b4568b_1 - parse=1.20.2=pyhd8ed1ab_0 - parso=0.8.4=pyhd8ed1ab_0 - pexpect=4.9.0=pyhd8ed1ab_0 - pickleshare=0.7.5=py_1003 - pip=24.0=pyhd8ed1ab_0 - pluggy=1.5.0=pyhd8ed1ab_0 - pre-commit-hooks=4.6.0=pyhd8ed1ab_0 - prompt-toolkit=3.0.47=pyha770c72_0 - ptyprocess=0.7.0=pyhd3deb0d_0 - pure_eval=0.2.2=pyhd8ed1ab_0 - py4j=0.10.9.7=pyhd8ed1ab_0 - pyarrow=16.1.0=py311h35c05fe_4 - pyarrow-core=16.1.0=py311hf5072a7_4_cpu - pycparser=2.22=pyhd8ed1ab_0 - pygments=2.18.0=pyhd8ed1ab_0 - pyproject_hooks=1.1.0=pyhd8ed1ab_0 - pysocks=1.7.1=pyha2e5f31_6 - pyspark=3.5.1=pyhd8ed1ab_0 - python=3.11.9=h932a869_0_cpython - python-build=1.2.1=pyhd8ed1ab_0 - python-dateutil=2.9.0=pyhd8ed1ab_0 - python-slugify=8.0.4=pyhd8ed1ab_0 - python-tzdata=2024.1=pyhd8ed1ab_0 - python_abi=3.11=4_cp311 - pytoolconfig=1.2.5=pyhd8ed1ab_0 - pytz=2024.1=pyhd8ed1ab_0 - pyyaml=6.0.1=py311heffc1b2_1 - re2=2023.09.01=h4cba328_2 - readline=8.2=h92ec313_1 - requests=2.32.3=pyhd8ed1ab_0 - rich=13.7.1=pyhd8ed1ab_0 - rope=1.13.0=pyhd8ed1ab_0 - ruamel.yaml=0.18.6=py311h05b510d_0 - ruamel.yaml.clib=0.2.8=py311h05b510d_0 - setuptools=70.1.1=pyhd8ed1ab_0 - six=1.16.0=pyh6c4a22f_0 - smmap=5.0.0=pyhd8ed1ab_0 - snappy=1.2.1=hd02b534_0 - stack_data=0.6.2=pyhd8ed1ab_0 - text-unidecode=1.3=pyhd8ed1ab_1 - tk=8.6.13=h5083fa2_1 - toml=0.10.2=pyhd8ed1ab_0 - tomli=2.0.1=pyhd8ed1ab_0 - traitlets=5.14.3=pyhd8ed1ab_0 - types-python-dateutil=2.9.0.20240316=pyhd8ed1ab_0 - typing-extensions=4.12.2=hd8ed1ab_0 - typing_extensions=4.12.2=pyha770c72_0 - tzdata=2024a=h0c530f3_0 - urllib3=2.2.2=pyhd8ed1ab_1 - wcwidth=0.2.13=pyhd8ed1ab_0 - wheel=0.43.0=pyhd8ed1ab_1 - xz=5.2.6=h57fd34a_0 - yaml=0.2.5=h3422bc3_2 - zipp=3.19.2=pyhd8ed1ab_0 - zstandard=0.22.0=py311h4a6b76e_1 - zstd=1.5.6=hb46c0d2_0 $ uv pip freeze antlr4-python3-runtime @ file:///home/conda/feedstock_root/build_artifacts/antlr-python-runtime-meta_1638309185939/work arrow @ file:///home/conda/feedstock_root/build_artifacts/arrow_1696128962909/work asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1704011227531/work binaryornot==0.4.4 brotli @ file:///Users/runner/miniforge3/conda-bld/brotli-split_1695989934239/work build @ file:///home/conda/feedstock_root/build_artifacts/python-build_1711647311753/work cachetools @ file:///home/conda/feedstock_root/build_artifacts/cachetools_1708987703938/work certifi @ file:///home/conda/feedstock_root/build_artifacts/certifi_1720457958366/work/certifi cffi @ file:///Users/runner/miniforge3/conda-bld/cffi_1696001836291/work chardet @ file:///Users/runner/miniforge3/conda-bld/chardet_1695468775373/work charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1698833585322/work click @ file:///home/conda/feedstock_root/build_artifacts/click_1692311806742/work colorama @ file:///home/conda/feedstock_root/build_artifacts/colorama_1666700638685/work cookiecutter @ file:///home/conda/feedstock_root/build_artifacts/cookiecutter_1708608886262/work decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work dynaconf @ file:///home/conda/feedstock_root/build_artifacts/dynaconf_1711462314416/work exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1704921103267/work executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1698579936712/work findspark @ file:///home/conda/feedstock_root/build_artifacts/findspark_1644599740637/work fsspec @ file:///home/conda/feedstock_root/build_artifacts/fsspec_1719514913127/work gitdb @ file:///home/conda/feedstock_root/build_artifacts/gitdb_1697791558612/work gitpython @ file:///home/conda/feedstock_root/build_artifacts/gitpython_1711991025291/work graphlib-backport @ file:///home/conda/feedstock_root/build_artifacts/graphlib-backport_1635566048409/work h2 @ file:///home/conda/feedstock_root/build_artifacts/h2_1634280454336/work hpack==4.0.0 hyperframe @ file:///home/conda/feedstock_root/build_artifacts/hyperframe_1619110129307/work idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1713279365350/work importlib-metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1719171233697/work importlib-resources @ file:///home/conda/feedstock_root/build_artifacts/importlib_resources_1711040877059/work ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1719582526268/work jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1715127149914/work kedro @ file:///home/conda/feedstock_root/build_artifacts/kedro_1717070510770/work kedro-databricks-dev==0.1.2 kedro-datasets @ file:///home/conda/feedstock_root/build_artifacts/kedro-datasets_1720352503469/work lazy-loader @ file:///home/conda/feedstock_root/build_artifacts/lazy_loader_1712342969017/work markdown-it-py @ file:///home/conda/feedstock_root/build_artifacts/markdown-it-py_1686175045316/work markupsafe @ file:///Users/runner/miniforge3/conda-bld/markupsafe_1706900210234/work matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1713250518406/work mdurl @ file:///home/conda/feedstock_root/build_artifacts/mdurl_1704317613764/work mergedeep==1.3.4 more-itertools @ file:///home/conda/feedstock_root/build_artifacts/more-itertools_1718048476694/work numpy @ file:///Users/runner/miniforge3/conda-bld/numpy_1718615086963/work/dist/numpy-2.0.0-cp311-cp311-macosx_11_0_arm64.whl omegaconf @ file:///home/conda/feedstock_root/build_artifacts/omegaconf_1670575376789/work packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1718189413536/work pandas @ file:///Users/runner/miniforge3/conda-bld/pandas_1715897646986/work parse @ file:///home/conda/feedstock_root/build_artifacts/parse_1718111026786/work parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1712320355065/work pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1706113125309/work pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work pip==24.0 platformdirs==4.2.2 pluggy @ file:///home/conda/feedstock_root/build_artifacts/pluggy_1713667077545/work pre-commit-hooks @ file:///home/conda/feedstock_root/build_artifacts/pre-commit-hooks_1712432387324/work prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1718047967974/work ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work py4j @ file:///home/conda/feedstock_root/build_artifacts/py4j_1660381574436/work pyarrow==16.1.0 pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1711811537435/work pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1714846767233/work pyproject-hooks @ file:///home/conda/feedstock_root/build_artifacts/pyproject_hooks_1714415182721/work pysocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work pyspark @ file:///home/conda/feedstock_root/build_artifacts/pyspark_1709470827857/work python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1709299778482/work python-slugify @ file:///home/conda/feedstock_root/build_artifacts/python-slugify-split_1707425621764/work pytoolconfig @ file:///home/conda/feedstock_root/build_artifacts/pytoolconfig_1675124745143/work pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1706886791323/work pyyaml @ file:///Users/runner/miniforge3/conda-bld/pyyaml_1695373486380/work requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1717057054362/work rich @ file:///home/conda/feedstock_root/build_artifacts/rich-split_1709150387247/work/dist rope @ file:///home/conda/feedstock_root/build_artifacts/rope_1711296293824/work ruamel-yaml @ file:///Users/runner/miniforge3/conda-bld/ruamel.yaml_1707298177423/work ruamel-yaml-clib @ file:///Users/runner/miniforge3/conda-bld/ruamel.yaml.clib_1707314793681/work setuptools==70.1.1 six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work smmap @ file:///home/conda/feedstock_root/build_artifacts/smmap_1634310307496/work stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work text-unidecode @ file:///home/conda/feedstock_root/build_artifacts/text-unidecode_1694707102786/work toml @ file:///home/conda/feedstock_root/build_artifacts/toml_1604308577558/work tomli @ file:///home/conda/feedstock_root/build_artifacts/tomli_1644342247877/work traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1713535121073/work types-python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/types-python-dateutil_1710589910274/work typing-extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1717802530399/work tzdata @ file:///home/conda/feedstock_root/build_artifacts/python-tzdata_1707747584337/work urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1719391292974/work wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1704731205417/work wheel==0.43.0 zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1718013267051/work zstandard==0.22.0 ```
astrojuanlu commented 4 months ago

Oh of course, there are two default keys. I have no idea whether that's valid YAML or not, or why PyYAML reads it just fine without complaining...

astrojuanlu commented 4 months ago

Well, and the reason is that every time I run kedro databricks init, the same contents are appended.

Renaming the issue.

JenspederM commented 4 months ago

You're completely right! I forgot about this bug, easily solved though.

Thanks for the description 😊

JenspederM commented 4 months ago

Fixed in #5. We now check if the files exist, and skip initialisation if that is the case.