CoreyMSchafer / code_snippets

MIT License
10.36k stars 17.55k forks source link

Repo is unnecessarily very large (504 MB) #190

Open bstrand opened 2 years ago

bstrand commented 2 years ago

Problem

This repo's size is currently 504 MB. This is an accessibility concern, in particular for users with slower or cost-metered internet. It also requires also users to give up a half GB of disk to keep the repo in sync (or be savvier about Git than your audience is likely to be.) Finally, it sets a questionable example for people new to programming and version control.

Description

The repo's total size is 504 MB, 90% of which comes from 15 image files present in Python/Threading and Python/MultiProcessing, each of which has 145 MB of .jpg image files.

In Python/Threading, these image files are downloaded from Unsplash by the tutorial script, so there seems to be little value in having them checked in with the code.

For Python/Multiprocessing, those image files used are input. Having them checked in to the repo is convenient, but not strictly necessary. Instead, the user could be asked to download them from Unsplash with Threading/download-images.py as a prerequisite. (Or allow them to use their own set of images by revising the tutorial script to target an arbitrary set of jpg's, e.g., *.jpg in a subdirectory.

At the very least, these image files could be much smaller (~10x) with heavier compression. (NB the repo would need to be filtered afterwards to remove the large files from the commit history.)

code_snippets on master✔ » du -h -d 1 . | sort -rh | head
504M    .
318M    ./Python
161M    ./.git
 24M    ./Django_Blog
 28K    ./Terminal
 …
code_snippets on master✔ » du -h -d 1 ./Python | sort -rh | head
318M    ./Python
145M    ./Python/Threading
145M    ./Python/MultiProcessing
 20M    ./Python/Flask_Blog
4.8M    ./Python/Matplotlib
…
code_snippets on master✔ » du -hsc ./Python/Threading/*.jpg | tail -n1
145M    total

Suggestions

Minimally:

  1. Recompress all large jpg's in the repo to reduce their file size
  2. Filter the repo to remove the larger versions (≥2 MB) of the files from the git history. (Could make for a good tutorial.)

Alternatively:

  1. Delete .jpg files in Threading and Multiprocessing from the repo entirely, and have the user download the images with Threading/download-images.py script.
  2. Replace the hard coded file names in Multiprocessing with loading *.jpg from a subdirectory so users can use their own / an arbitrary set of images.
  3. Add input/output directories and exclude them in .gitignore
  4. Filter the repo to remove the larger versions of the files from the git history