datacontract / datacontract-cli

CLI to manage your datacontract.yaml files
https://cli.datacontract.com
Other
352 stars 60 forks source link

Add pre-commit hook configuration #280

Closed burakince closed 5 days ago

burakince commented 5 days ago

This pull-request is implementation of issue #279

Currently, main branch (Commit hash: 0c139521a680b22f1bcbb08bab5250d4197a1265) and tag v0.10.8 are getting the following error.

Traceback (most recent call last):
  File "/var/folders/pb/3h4kqhbs01g8qqhj9vpp8kvc0000gn/T/tmplr11kfy7/repo36bh8g_4/py_env-python3.12/bin/datacontract", line 5, in <module>
    from datacontract.cli import app
  File "/private/var/folders/pb/3h4kqhbs01g8qqhj9vpp8kvc0000gn/T/tmplr11kfy7/repo36bh8g_4/py_env-python3.12/lib/python3.12/site-packages/datacontract/cli.py", line 16, in <module>
    from datacontract import web
  File "/private/var/folders/pb/3h4kqhbs01g8qqhj9vpp8kvc0000gn/T/tmplr11kfy7/repo36bh8g_4/py_env-python3.12/lib/python3.12/site-packages/datacontract/web.py", line 7, in <module>
    from datacontract.data_contract import DataContract, ExportFormat
  File "/private/var/folders/pb/3h4kqhbs01g8qqhj9vpp8kvc0000gn/T/tmplr11kfy7/repo36bh8g_4/py_env-python3.12/lib/python3.12/site-packages/datacontract/data_contract.py", line 7, in <module>
    from pyspark.sql import SparkSession
ModuleNotFoundError: No module named 'pyspark'

On the other hand, tag v0.10.7 works well. You can find an example from here.

Simply, you can use the following configuration in your .pre-commit-config.yaml and test the idea.

repos:
  - repo: https://github.com/burakince/datacontract-cli
    rev: "v0.10.7-rc1"
    hooks:
      - id: datacontract-linting

Additionally, you can test your local pre-commit hook changes in an example test repository. Please follow these steps:

pre-commit try-repo ../datacontract-cli datacontract-linting --verbose --all-files
jochenchrist commented 5 days ago

Currently, main branch (Commit hash: https://github.com/datacontract/datacontract-cli/commit/0c139521a680b22f1bcbb08bab5250d4197a1265) and tag v0.10.8 are getting the following error.

With v0.10.8, the library uses extras. You can fix it with pip install datacontract-cli[all]

burakince commented 5 days ago

Hi @jochenchrist ,

I'd like to explain how pre-commit works. Generally, you don't need to install anything except pre-commit itself for pre-commit hooks. It automatically installs all the required packages as needed. In our case, it attempts to install the datacontract-cli Python package, but it seems that additional packages (e.g., pyspark) are causing installation issues. These extra packages shouldn't be required for the default installation, so there might be a misconfiguration somewhere. Notably, version v0.10.7 works perfectly fine.

In short, users never need to run the pip install datacontract-cli command for the pre-commit system. Pre-commit installs it in the background.

Best regards, Burak

burakince commented 5 days ago

These definitions only for Python usage. What do you think @jochenchrist , shall we also define id for the docker image usage?

The configuration could be as below:

-   id: datacontract-lint-docker
    name: Data Contract Linter with Docker Image
    description: This hook lint the data contract with official docker image
    entry: datacontract/cli lint
    files: "datacontract*.yaml"
    language: docker_image
    types: [yaml]
    minimum_pre_commit_version: 0.15.0

-   id: datacontract-test-docker
    name: Data Contract Tester with Docker Image
    description: This hook test the data contract with official docker image
    entry: datacontract/cli test
    files: "datacontract*.yaml"
    language: docker_image
    types: [yaml]
    minimum_pre_commit_version: 0.15.0
jochenchrist commented 5 days ago

Hi @jochenchrist ,

I'd like to explain how pre-commit works. Generally, you don't need to install anything except pre-commit itself for pre-commit hooks. It automatically installs all the required packages as needed. In our case, it attempts to install the datacontract-cli Python package, but it seems that additional packages (e.g., pyspark) are causing installation issues. These extra packages shouldn't be required for the default installation, so there might be a misconfiguration somewhere. Notably, version v0.10.7 works perfectly fine.

In short, users never need to run the pip install datacontract-cli command for the pre-commit system. Pre-commit installs it in the background.

Best regards, Burak

I understand, however it clashes with the wish for optional dependencies to reduce depdencendy tree size. A workaround could be to use additional_dependencies

additional_dependencies:
  - datacontract-cli[all]
burakince commented 5 days ago

Hi @jochenchrist ,

I have added all extras as additional dependencies and tested them locally. Everything is working fine now. We should consider finding a better solution in the future.

Best regards, Burak

jochenchrist commented 5 days ago

OK, everything's fine now, thanks for your contribution!