pre-commit numpydoc validation

ghost commented 6 months ago

Goal

To enforce the usage of numpydoc for documenting functions, we can leverage a pre-commit hook designed for this: https://numpydoc.readthedocs.io/en/latest/validation.html. This could be a heavy lift!

Context

All issues tagged with documentation.

Specifications

[ ] Setup the pre-commit hook to validate numpydoc
[ ] Update the model package once the pre-commit hook is running.

dylanhmorris commented 6 months ago

@AFg6K7h4fhy2 are you up for doing this?

AFg6K7h4fhy2 commented 6 months ago

@AFg6K7h4fhy2 are you up for doing this?

Yes, sounds good.

AFg6K7h4fhy2 commented 6 months ago

Situation is the following:

Created new branch called numpydoc
Added numpydoc to the ./model poetry environment
Updated the .pre-commit-config.yaml in ./

-   repo: https://github.com/numpy/numpydoc
    rev: v1.6.0
    hooks:
    -   id: numpydoc-validation

which works:

You can observe there are things we don't want touched (e.g., __init__.py, but the exclude/ignore system is what I am having trouble with. The numpydoc pre-commit link affords this starter for the pyproject.toml file

[tool.numpydoc_validation]
checks = [
    "EX01",
    "SA01",
    "ES01",
]
exclude = [  # don't report on objects that match any of these regex
    '\.undocumented_method$',
    '\.__repr__$',
    '\.__init__$',
]
override_SS05 = [  # override SS05 to allow docstrings starting with these words
    '^Process ',
    '^Assess ',
    '^Access ',
]

I added this to the pyproject.toml files in both pipeline and model. I then added '\.__init__$', to the exclude list in both .toml files. I then activated the shell environments and ran pre-commit run --all-files to see if the added directives worked. They did not seem to work in the sense I got the same numpydoc-validation pre-commit flags.

I will keep exploring getting the exclusion directives correctly before attempting to merge the branches. I will also try to address the numpydoc suggested changes. Alternatively, I can comment out the numpydoc section in the .pre-commit-config.yaml file and then merge, so that @gvegayoncdc can manipulate which flags to exclude (this seems less likely to produce wasted effort).

@gvegayoncdc @dylanhmorris

ghost commented 6 months ago

This will be a nice tool, but as I said earlier, it is heavy lifting! I suggest giving priority to the sphinx website before working more in this.

AFg6K7h4fhy2 commented 5 months ago

While using numpydocs in MSR is bottlenecked by 47, I tried playing around with numpydocs in my sandbox, but was unable to properly ignore some checks, despite trying for some time. Of course, in a mature pipeline, I expect to configure numpydocs deliberately, instead of catch-all ignoring, but I still want to understand how to do this, without using a toml configuration file. For visibility, I will detail the process here.

See the following links for more details:

The last link provides an answer, though I expect there is still a way to do what I want without having to use configuration in my pyproject.toml file.

The correct pre-commit specification (as of 2024-04-12) is this:

-   repo: https://github.com/numpy/numpydoc
    rev: v1.7.0
    hooks:
    -   id: numpydoc-validation

which returns the following in my sandbox (the last few lines)

Note also that for help one can run

python3 -m numpydoc.hooks.validate_docstrings --help

I am interested in the ignoring clause of this documentation:

I have tried the following (each args constituted a different attempt) under id in the pre-commit specification, yet only the first pattern works (element 1 and 2).

args = ['--ignore=GL08']
args = ['--ignore=EX01']
args = ['--ignore=GL08,EX01']
args = ['--ignore=GL08', '--ignore=EX01']
args = ['--ignore', 'GL08', '--ignore', 'EX01']

This was the pre-commit specification:

In

-   repo: https://github.com/numpy/numpydoc
    rev: v1.7.0
    hooks:
    -   id: numpydoc-validation
        args: ['--ignore=GL08']

So, how does one ignore multiple checks in the above specification?

Anyhow, this is what one can do within their toml file to exclude, EX01, GL08, SA01.

[tool.numpydoc_validation]
checks = [
    "all",  
    "EX01",
    "GL08",
    "SA01"
]

There is no urgency or real need with answering how to ignore checks within the pre-commit specification, I am simply curious, especially since it seems so easy with other pre-commit hooks such as with ruff, e.g.

-   repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.3.6
    hooks:
      - id: ruff
        args: ['--ignore=E741', '--ignore=E731']

gvegayon commented 5 months ago

Have you asked AI to help you with that?

AFg6K7h4fhy2 commented 5 months ago

Have you asked AI to help you with that?

Yes, after trying the above, I did use the CDC AI tool, but it kept wanting me to create my own pre-commit hook instead. Given numpydocs low version and (relatively) lower star count, I think this is a situation similar to polars, i.e. one where there is a package representation dearth in the training data. Thank you for the prompting though.

Regardless, MSR can stick with toml configuration for numpydocs.

AFg6K7h4fhy2 commented 4 months ago

Pasting this here so we don't have to look at the site as often

ERROR_MSGS = {
    "GL01": "Docstring text (summary) should start in the line immediately "
    "after the opening quotes (not in the same line, or leaving a "
    "blank line in between)",
    "GL02": "Closing quotes should be placed in the line after the last text "
    "in the docstring (do not close the quotes in the same line as "
    "the text, or leave a blank line between the last text and the "
    "quotes)",
    "GL03": "Double line break found; please use only one blank line to "
    "separate sections or paragraphs, and do not leave blank lines "
    "at the end of docstrings",
    "GL05": 'Tabs found at the start of line "{line_with_tabs}", please use '
    "whitespace only",
    "GL06": 'Found unknown section "{section}". Allowed sections are: '
    "{allowed_sections}",
    "GL07": "Sections are in the wrong order. Correct order is: {correct_sections}",
    "GL08": "The object does not have a docstring",
    "GL09": "Deprecation warning should precede extended summary",
    "GL10": "reST directives {directives} must be followed by two colons",
    "SS01": "No summary found (a short summary in a single line should be "
    "present at the beginning of the docstring)",
    "SS02": "Summary does not start with a capital letter",
    "SS03": "Summary does not end with a period",
    "SS04": "Summary contains heading whitespaces",
    "SS05": "Summary must start with infinitive verb, not third person "
    '(e.g. use "Generate" instead of "Generates")',
    "SS06": "Summary should fit in a single line",
    "ES01": "No extended summary found",
    "PR01": "Parameters {missing_params} not documented",
    "PR02": "Unknown parameters {unknown_params}",
    "PR03": "Wrong parameters order. Actual: {actual_params}. "
    "Documented: {documented_params}",
    "PR04": 'Parameter "{param_name}" has no type',
    "PR05": 'Parameter "{param_name}" type should not finish with "."',
    "PR06": 'Parameter "{param_name}" type should use "{right_type}" instead '
    'of "{wrong_type}"',
    "PR07": 'Parameter "{param_name}" has no description',
    "PR08": 'Parameter "{param_name}" description should start with a '
    "capital letter",
    "PR09": 'Parameter "{param_name}" description should finish with "."',
    "PR10": 'Parameter "{param_name}" requires a space before the colon '
    "separating the parameter name and type",
    "RT01": "No Returns section found",
    "RT02": "The first line of the Returns section should contain only the "
    "type, unless multiple values are being returned",
    "RT03": "Return value has no description",
    "RT04": "Return value description should start with a capital letter",
    "RT05": 'Return value description should finish with "."',
    "YD01": "No Yields section found",
    "SA01": "See Also section not found",
    "SA02": "Missing period at end of description for See Also "
    '"{reference_name}" reference',
    "SA03": "Description should be capitalized for See Also "
    '"{reference_name}" reference',
    "SA04": 'Missing description for See Also "{reference_name}" reference',
    "EX01": "No examples section found",
}

AFg6K7h4fhy2 commented 4 months ago

For which checks to retain, I've decided to include, as a baseline, the following:

ERROR_MSGS = {
    "GL03": "Double line break found; please use only one blank line to "
    "separate sections or paragraphs, and do not leave blank lines "
    "at the end of docstrings",
    "GL08": "The object does not have a docstring",
    "SS01": "No summary found (a short summary in a single line should be "
    "present at the beginning of the docstring)",
    "PR01": "Parameters {missing_params} not documented",
    "PR02": "Unknown parameters {unknown_params}",
    "PR03": "Wrong parameters order. Actual: {actual_params}. "
    "Documented: {documented_params}",
    "PR04": 'Parameter "{param_name}" has no type',
    "PR07": 'Parameter "{param_name}" has no description',
    "RT01": "No Returns section found",
}

For functions in which there is no need for such checks (e.g., some validate() methods), I have manually flagged (e.g. via # numpydoc ignore=EX01,SA01,ES01) not to have the checks performed.

CDCgov / PyRenew