StatCan / datascience-cookiecutter

A Cookiecutter template for Data Science Projects in Python
MIT License
7 stars 1 forks source link

detect-secrets hook fails against Jupyter Notebook #50

Closed vedantthapa closed 1 year ago

vedantthapa commented 1 year ago

A .ipynb file created using VSCode IDE contains a hash of the interpreter which is flagged as sensitive by the detect-secrets hook thereby causing it to fail. Here's a snippet using the cat <notebook-name>.ipynb command:

{
  [...]
  "metadata": {
    "interpreter": {
      "hash": "STRING-HASH"
    }, 
   [...]
  }
  [...]
}

This has already been reported at Yelp/detect-secrets#549. However, a workaround using line extensions has been suggested, would a PR implementing this be acceptable? / should we use a different strategy to handle this?

ToucheSir commented 1 year ago

I reported something similar in https://github.com/StatCan/datascience-cookiecutter/issues/8. It was handled in #35 by stripping out the relevant keys. I have to say I don't quite understand the proposed fix in https://github.com/Yelp/detect-secrets/issues/549#issuecomment-1151448733: is it excluding .ipynb files from secret scanning altogether?

vedantthapa commented 1 year ago

@ToucheSir I think the JSON hierarchy has changed and now metadata has an additional vscode key (see example.ipynb below) that is not included in: https://github.com/StatCan/datascience-cookiecutter/blob/1772e3362f09561034b7731e59556390531708fd/%7B%7B%20cookiecutter.repo_name%20%7D%7D/.pre-commit-config.yaml#L8-L11

example.ipynb ``` { "cells": [...], "metadata": { "language_info": { "name": "python", "version": "3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:28:38) [MSC v.1929 64 bit (AMD64)]" }, "vscode": { "interpreter": { "hash": "STRING-HASH" } } }, "nbformat": 4, "nbformat_minor": 2 } ```

This might be the reason that detect-secret hook fails because nbstripout is not stripping the hash value. Adding metadata.vscode to the args fixes the issue.

Regarding https://github.com/Yelp/detect-secrets/issues/549#issuecomment-1151448733, there are two separate hooks: the first one named detect-secrets excludes .ipynb files altogether and the second one named detect-secrets-jupyter excludes lines that contain the hash value. Its unclear to me as well that why --exclude-files arg has been used again in the second one.

In any case, my plan was to only add the --exclude-lines arg and its value at: https://github.com/StatCan/datascience-cookiecutter/blob/1772e3362f09561034b7731e59556390531708fd/%7B%7B%20cookiecutter.repo_name%20%7D%7D/.pre-commit-config.yaml#L67

I tried both of these fixes and they seem to resolve the original issue. Let me know what you think.

ToucheSir commented 1 year ago

It's possible nobody was using this template with VSCode-created (or edited?) notebooks, so this was never caught. Given that --exclude-lines does not appear to discriminate by file type, maybe it's safer to amend the nbstripout config to strip metadata.vscode.interpreter or metadata.vscode.interpreter.hash instead?

vedantthapa commented 1 year ago

@ToucheSir I was testing the fix on my personal laptop (because the work one won't boot up) and noticed that the raw contents of a Jupyter Notebook created using VSCode no longer have the vscode key inside of metadata as mentioned in my previous comment. Therefore, the detect-secrets hook never complains about the hash.

My guess is that it's probably due to the VSCode installation on my machine. Here are the details:

Version: 1.76.2 (Universal)
Commit: ee2b180d582a7f601fa6ecfdad8d9fd269ab1884
Date: 2023-03-14T17:54:09.061Z (3 days ago)
Electron: 19.1.11
Chromium: 102.0.5005.196
Node.js: 16.14.2
V8: 10.2.154.26-electron.0
OS: Darwin arm64 22.3.0
Sandboxed: No

I'll still submit a PR to resolve the issue for internal environments but just wanted to point this out.

ToucheSir commented 1 year ago

I have a theory about why that might be the case (shoot me a message if you'd like to discuss), but sounds good!