CamDavidsonPilon / Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)
http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/
MIT License
26.68k stars 7.87k forks source link

Repo too large #19

Open CamDavidsonPilon opened 11 years ago

CamDavidsonPilon commented 11 years ago

Currently the repo is +25mb. Thats rediculous. Most of it is the data files associated with chapters. Either

ghost commented 11 years ago

I suggest binary data be stripped from notebooks under revision control, and have full versions available via s3 or similar directly.

if there was a way to strip binary data from notebooks at the commad line, you could automate this with a git commit hook . /cc @Carreau

That could also be used to strip binary data from the git history using git filter-branch, at the cost of having every existing fork diverge from git master.

Carreau commented 11 years ago

I guess you are looking for this : https://gist.github.com/minrk/3719849 (found on cookbook)

I, personally, don't like removing output, as the point of ipynb is to store all the computation. You can't actually read anything without executing, it becomes also useless on nbviewer. The other thing is that images are base64 encoded which add 33% overhead.

If you want smaller "ipynb" you can try to convert to msgpack I suppose one can write a backend where ipynb are directory with subdirectories and each cell would be file, git would then only track the cell that changed. We don't have enough manpower to bake this into IPython core, but we do our best to have "plugable" backend.

CamDavidsonPilon commented 11 years ago

An interesting idea @y-p, I'll look into this. I'm not worried about stripping data from git history.

ghost commented 11 years ago

I ran into something similar to this in the past, so I dug my script. git gc has some issues you need to work around, but after stripping binaries, png, pdf and csv, the end result is a 1.5mb repo, you can add back the csv files zipped, and read them using the zipfile module, but if they ever change,you'll eventually get bloat again.

I updated the above gist with the image-only stripping version, that also rennumbers prompt to reduce diff noise. will make it easier to merge PRs from contributors.

elofgren commented 11 years ago

Given this is something of a teaching and learning tool, is "make the repo" smaller actually a problem? I'd rather have everything be accessible and easily run rather than save what is in all honesty a pretty minor amount of space & bandwidth.

CamDavidsonPilon commented 11 years ago

@elofgren my main gripe is that there are a bunch of .csv (eg: Chapter5_LossFunctions\data\Train_Skies\Train_Skies) that most users will never see, but for completeness should be there. I just feel this won't scale if I want to add more datasets (potentially larger). Users might not want to download/pull 100mb+.

martijnvermaat commented 11 years ago

Just a wild suggestion, and I see some obvious downsides, but this would be a great use case for git-annex (similar setup).

It's too bad that GitHub cannot be used to actually store git-annex tracked files, but you could host them elsewhere and add them to git-annex as url.

This way you'd still track revisions, have a small initial repository clone, and a (pretty) easy way of fetching some of the larger files (i.e. installing git-annex and using git annex get on the file you need).

xcthulhu commented 11 years ago

Have you tried using npz files?

Documentation: http://docs.scipy.org/doc/numpy/reference/generated/numpy.savez.html#numpy.savez

Another idea: you could make another git repo for data, and then in your notebooks replace open() calls with urllib2.urlopen() calls. Then, you could effectively download data on a per-notebook basis (and kept in RAM until quitting).

Documentation: http://docs.python.org/2/library/urllib2.html#urllib2.urlopen

cebe commented 10 years ago

Haven't read the whole discussion but I think the main problem with repo size comes from the type of file you have in here. You are using one file per chapter including all text and images at once which makes about 1MB per file. We are doing very minor editing to these files and Git stores a new version of a file each time the sha1 hash of a file changes so chaning a typo in a file will increase repo size about 1MB or a bit less dependend on which files you edit and git also compresses old files. So unless you change the storage format to something different there is not much you can do about repo size.

rsvp commented 7 years ago

There's nbstripout https://github.com/kynan/nbstripout which addresses the size reduction of git commits, including removal of notebook metadata. A lot of space is taken up by the graphics, so there's this focus, https://github.com/kynan/nbstripout/issues/58

However, the downside is that the user would have to run a notebook to regenerate the (beautiful) images -- that will hamper someone merely reading a rendering of a chapter for his education.

So the issue, apart from segregating the data into another repo, is the trade-off between size vs. pedagogy.