NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
331 stars 33 forks source link

Add style check #3

Closed ayushdg closed 3 months ago

ayushdg commented 3 months ago

The aim of the PR is to add pre-commit configs and a GHA for style checks. The PR does not attempt to fix all the style failures present in the repo. That will be done in a followup PR.

The config ends up using isort, black & flake8 (some of it included in the contributing guide).

Happy to also explore ruff which should be a much faster drop in replacement for some of the tools mentioned above and has been adopted by many large python projects.

ryantwolf commented 3 months ago

Looks like the .pre-commit-config.yaml of both NeMo Aligner and NeMo are slightly different than ours. Can we change ours to match theirs exactly, or is there a reason ours should be different?

ayushdg commented 3 months ago

Looks like the .pre-commit-config.yaml of both NeMo Aligner and NeMo are slightly different than ours. Can we change ours to match theirs exactly, or is there a reason ours should be different?

There's no reason for them to match exactly, some of it depends on the practices we want to follow for the repo so it's good to go over the differences:

  1. pre-commit-hooks: It makes sense to include the hooks included in these libraries, but probably makes sense to keep the additional Trailing-whitespace, end-of-file-fixer unless we find specific examples in the repo where it doesn't make sense. Full list of hooks here.
  2. Black: Pretty much identical
  3. Isort: I'd argue that adding the black profile should be the case since we're using black as recommended by isort.
  4. Flake8: This is personal preference, a lot of projects prefer some level of linting flake8 is one option that combines rules from PyCodeStyle and PyFlakes. black formatting arguably fixes some of these rules, but happy to omit this, don't have a strong preference.
  5. ci: I'm happy to explore using pre-commit.ci over the pre-commit action directly provided it doesn't require any additional enterprise approvals.
ryantwolf commented 3 months ago

Ok cool. Thanks for the clarification.

  1. Yup let's add the ones they use and keep the ones you already have.
  2. Good.
  3. Just looked into this tool. I like it so let's keep it too haha. Do we need to exclude any directories like aligner does?
  4. I'm fine with keeping it so long as we don't run into any issues.
  5. Yeah if you could look into that that could be good. Just to keep things in line with the rest of the repos. If it ends up being a hassle though let me know and we can stick with this.
ayushdg commented 3 months ago

Isort: I'd argue that adding the black profile should be the case since we're using black as recommended by isort.

Actually I stand corrected, other repos do use this but specify the config in their pyproject.toml file. I'll probably do the same.

ayushdg commented 3 months ago

This also has the side effect of pre-commit.ci auto formatting all the files. @ryantwolf Are you okay updating all the file formats in this PR?

ryantwolf commented 3 months ago

It had to happen at some point haha. Let's reformat the code 🫡

ayushdg commented 3 months ago

Had some weirdness during the rebasing with signoff. Will open a new PR