Add `pre-commit` - Githubissues

This PR seeks to add pre-commit (see here) to this repository.

Git hook scripts are useful for identifying simple issues before submission to code review. We run our hooks on every commit to automatically point out issues in code such as missing semicolons, trailing whitespace, and debug statements. By pointing these issues out before code review, this allows a code reviewer to focus on the architecture of a change while not wasting time with trivial style nitpicks.

As we created more libraries and projects we recognized that sharing our pre-commit hooks across projects is painful. We copied and pasted unwieldy bash scripts from project to project and had to manually change the hooks to work for different project structures.

We believe that you should always use the best industry standard linters. Some of the best linters are written in languages that you do not use in your project or have installed on your machine. For example scss-lint is a linter for SCSS written in Ruby. If you’re writing a project in node you should be able to use scss-lint as a pre-commit hook without adding a Gemfile to your project or understanding how to get scss-lint installed.

We built pre-commit to solve our hook issues. It is a multi-language package manager for pre-commit hooks. You specify a list of hooks you want and pre-commit manages the installation and execution of any hook written in any language before every commit. pre-commit is specifically designed to not require root access. If one of your developers doesn’t have node installed but modifies a JavaScript file, pre-commit automatically handles downloading and building node to run eslint without root.

In particular, the following hooks are used (@CodexVeritas please verify which ones you want or do not want):

check-added-large-files: Checks for "large" files, with a file size limit that can be set.
end-of-file-fixer: Ensures that each file ends with a single newline character.
mixed-line-ending: Detects and standardizes line endings to avoid mixing CRLF (\r\n) and LF (\n) within the same file.
trailing-whitespace: Removes whitespace characters at the end of each line
ruff & black: general organizing of Python code and adherence to PEP 8.
ignore E741: avoid ambiguous variable names (so, ambiguous variables are allowed).
ignore E731: avoid lambda expressions assigned to a variable (so, lambda expresssion are allowed).
isort sorts and organizes imports according to PEP 8.

repos:
################################################################################
# GENERAL HOOKS
################################################################################
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.6.0
    hooks:
    -   id: check-added-large-files
        args: ['--maxkb=10000']
    -   id: check-yaml
        args: [--allow-multiple-documents]
    -   id: check-toml
    -   id: end-of-file-fixer
    -   id: mixed-line-ending
    -   id: trailing-whitespace
################################################################################
# PYTHON SPECIFIC
################################################################################
-   repo: https://github.com/psf/black
    rev: 24.8.0
    hooks:
    -   id: black
        args: ['--line-length', '79']
-   repo: https://github.com/PyCQA/isort
    rev: 5.13.2
    hooks:
    -   id: isort
        args: ['--profile', 'black',
               '--line-length', '79']
-   repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.6.7
    hooks:
      - id: ruff
        args: ['--ignore=E741', '--ignore=E731', '--fix']
################################################################################
# TYPOS
################################################################################
-   repo: https://github.com/crate-ci/typos
    rev: v1.24.6
    hooks:
    -   id: typos
        args: ["--force-exclude"]
################################################################################

There are more hooks that can be added. Here the author also adds the typos hook, which detects typing errors:

# Typos
-   repo: https://github.com/crate-ci/typos
    rev: v1.21.0
    hooks:
    -   id: typos
        args: ['--force-exclude']

Sometimes these errors are not, in fact, errors (false positives). To remedy this, one can define a file called _typos.toml to exclude certain patterns. Running pre-commit run --all-files after the typos hook is defined gives errors for "PN" and "OPF" when these are acronyms we want.

We do not want to fix "PN" or "OPF" so the _typos.toml file will have:

[default.extend-words]
# words that should not be corrected
PN = "PN"
OPF = "OPF"

Now the "PN" and "OPF" are ignored upon re-running pre-commit run --all-files. If you want to ignore and entire file type (e.g. .bib files) the [files] section of the _typos.toml file can be used for this:

[files]
# file types that should not be evaluated
extend-exclude = [
    ".gitignore",
    ".pre-commit-config.yaml",
    "*.bib",
    "*.html",
    "*.js",
    "*.csl",
    "*.css",
    "*.ipynb",
    "*.json"
]

More information can be found in the PR.

Examples of existing errors found with typos in this repository. --force-exclude makes it so the errors aren't automatically fixed.

error: `caculate` should be `calculate`
  --> tests/cheap/test_ai_models/test_models_tracking_token_cost.py:68:38
   |
68 |     predicted_prompt_cost_v1 = model.caculate_cost_from_tokens(
   |                                      ^^^^^^^^
   |
error: `caculate` should be `calculate`
  --> tests/cheap/test_ai_models/test_models_tracking_token_cost.py:86:42
   |
86 |     predicted_completion_cost_v1 = model.caculate_cost_from_tokens(
   |                                          ^^^^^^^^
   |
error: `higlight` should be `highlight`
  --> front_end/mokoresearch_site/app_pages/forecaster_page.py:51:140
   |
51 |             "Enter the information for your question. Exa.ai is used to gather up to date information. Each citation attempts to link to a higlight of the a ~4 sentence quote found with Exa.ai. This project is in beta some inaccuracies are expected."
   |                                                                                                                                            ^^^^^^^^
   |
error: `occurances` should be `occurrences`
  --> src/forecasting/sub_question_responders/base_rate_responder.py:42:123
   |
42 |     DESCRIPTION_OF_WHEN_TO_USE = "Use this responder when online information is needed about historical rates, historical occurances, and future probabilities"
   |                                                                                                                           ^^^^^^^^^^
   |
error: `succesfully` should be `successfully`
  --> src/forecasting/sub_question_responders/base_rate_responder.py:323:150
    |
323 |             For instance when predicting whether Apple will get sued related to a recent lawsuit, it is more useful to know how often Apple has been succesfully sued for patent violations per time they are sued (event) than per day.
    |                                                                                                                                                      ^^^^^^^^^^^
    |
error: `defintion` should be `definition`
  --> src/forecasting/sub_question_responders/base_rate_responder.py:419:140
    |
419 |             A valid question is one that is about base rates or how often something has happened in the past. Remember, be loose with your defintion. We are just trying to remove clearly off topic questions, or prompt leaking.
    |                                                                                                                                            ^^^^^^^^^

CodexVeritas / forecasting-tools

Add `pre-commit` #2