XAMPPRocky / tokei

Count your code, quickly.
Other
10.51k stars 504 forks source link

Jupyter notebook (`.ipynb`) is blowing the number of lines out of proportion #1072

Open nfx opened 3 months ago

nfx commented 3 months ago

Jupyter Notebooks are JSON-serialized lines of code, though they are producing incorrect code size estimates with this great tooling. See this example - tokei counts it as 496 lines of JSON code, but in fact it's 60 python code lines and 19 markdown lines.

import requests
nb = requests.get("https://raw.githubusercontent.com/databrickslabs/mosaic/2ec5d9da032db0d8209e910d4378c959c8fc7ddc/docs/source/usage/grid-indexes.ipynb").json()
markdown_lines = sum(sum(len(line.split("\n")) for line in cell['source']) for cell in nb['cells'] if cell['cell_type'] == 'markdown')
code_lines = sum(sum(len(line.split("\n")) for line in cell['source']) for cell in nb['cells'] if cell['cell_type'] == 'code')
print(markdown_lines + code_lines)
76 # 15% of 496

Why should we care? There are 5M+ Jupyter notebook files on github

XAMPPRocky commented 3 months ago

Thank you for your issue! Are you using an old version of tokei? Tokei has support for reading notebooks.

nfx commented 3 months ago

@XAMPPRocky that's from the github badge then - eg couple of projects misreport millions of lines, whereas those are mostly notebook output.

P.S. brazing fast tool for counting XX million lines of code 🎊