Closed EiffL closed 4 years ago
Hum.... also, looks like the pdfs are committed every time... that's about 500KB everytime, quickly adds up
So @EiffL do you have a suggestion about what to do instead of committing the pdfs? I agree it's dumb (@rmjarvis just pointed this out too and I opened an issue #768 ) - he said that his policy is to just make the pdf himself in master pre-releases, however we have always historically had a total embargo on anybody pushing anything to master. We could make an exception for this, but I'm curious if you have another good workaround.
I would argue that no one other than ourselves is looking at the note, and that we could just remove its pdf from the repo (and ask users to compile it themselves)
We "could" setup some travis magic that execute a "make" in the doc folders upon for instance a new release. not sure what's the best trigger for this
Was the paper branch squash merged? Else there might be a lot of figures etc in the git history there.
I think there is not much utility in having the pdf in the git repo. Perhaps you could periodically put a current version on the wiki. Say every time you release a new version, update the note on the wiki.
(just need to make sure the paper compiles, through travis I guesss)
You could certainly add that as a "unit" test. Run make in the paper directories and ensure that the pdf was generated.
I like the idea of having it on the wiki. @elisachisari do you have thoughts on this from your experience with the paper and note?
FYI about the wiki, it has its own git repo, which in my experience is often easier to deal with than the web interface. cf. https://help.github.com/en/github/building-a-strong-community/adding-or-editing-wiki-pages#adding-or-editing-wiki-pages-locally
So everyone, maybe also @beckermr, I have only used a few times the BFG repo-cleaner: https://rtyley.github.io/bfg-repo-cleaner/ to purge large files from git history.
It worked fine for me, but you have to be careful not to push from an old clone of the repo after cleaning because the git histories will be incompatible. I don't actually know what happens if you try to push, but probably nothing good.
The problem with a project like CCL is that a lot of people have a lot of cloned version, and I'm afraid that someone might try to force a push of an old clone and break everything.
Does anyone have more experience with purging files from an active git repo with lots of contributors?
I have been down this road before. To be honest, this is something we are going to have to live with. Repo cleaning tools can help, but they are not failsafe. Aggressive cleaning of repos can make them unusable too.
I had to do this once for Piff when someone pushed up some huge files intended for testing. After getting rid of them from history, the recommendation for everyone who had pulled this particular branch in the meanwhile was
git fetch
git rebase
git reflog expire --expire=now --all
git gc --aggressive --prune=now
For your case, I guess you'd want to do that on master and then rebase any extant branches to the new master. So best if you can manage to merge most things into master before doing this, so people only have a simple set of commands they need to do on their own home versions.
Also, I did this by hand with git --filter-branch
which felt a bit scary, but worked out ok. I bet a tool like bfg would be easier. (Is that a reference to Doom or Dahl you think? Or neither?)
@rmjarvis @EiffL @beckermr : throwing out a random idea. To avoid some of the issues with cleaning the repo's history, what if we just renamed and archived the current repo and started using a fresh one with the large files no longer tracked? Anyone developing from the current repo (with work not merged before we do this break) would need to cherry pick individual files into a new PR. But there aren't that many developers, and this would ensure that that future pushes don't conflict with the changed history. Maybe too extreme?
I’m ok with the repo size. I’d just leave it.
I think if we are careful we can trim the fat from this repo and nothing bad will happen, we can make backups and restore things afterwards if need be, just to safe.
I could do it, but I would want one or two extra pairs of eyes on it, just to make sure I don't wreck everything by accident.
Alternatively, since apparently not many people have complained besides me, maybe it's not such a big deal at this point, so we could just adopt some good practices for now, like not committing the PDFs and data files, and revisit when/if it becomes an issue for more people.
I'll leave that choice to the admins :-) I know in theory how to technically do both back up/restore and cleaning, but have never attempted on an active repo, so I admit that I'm afraid of wrecking things.
I'll leave that choice to the admins :-) I know in theory how to technically do both back up/restore and cleaning, but have never attempted on an active repo, so I admit that I'm afraid of wrecking things.
Yep. We should let this one die. I am going to close for now.
On the other hand, this would probably be easier to pull off now, rather than in 5 years.
For some perspective on this, it currently takes 38 seconds to clone the repo on my laptop. That's not completely insane, but it is already comparable to some much older git repos that also include some data files for testing and have far more commit history than CCL. GalSim is 40 sec and TreeCorr is 18 sec.
So I'd at least think about ways to staunch the bleeding. Stop including the pdf in the history going forward. And probably move some of the data files to a wiki or git-ls repo so when you need to add more or update the existing ones, that will be the obvious place to do so, and it won't add to the CCL repo size.
Stop including the pdf in the history going forward. And probably move some of the data files to a wiki or git-ls repo so when you need to add more or update the existing ones, that will be the obvious place to do so, and it won't add to the CCL repo size.
100% agreed!
So, I'm looking at doing the suggested staunching-of-bleeding now. @rmjarvis - I'm a little confused about your suggestion to move data files out of the repo, since they are used when running tests? How would that work?
For context, I checked yesterday and the class repo is >500 MB. Pretty crazy stuff.
The main "stanching the bleeding" suggestion was to stop hosting the note pdf in the repo. Just have one of your unit tests be to build the pdf to make sure there aren't any tex errors, but don't commit the pdf.
As for how to use data files not in the repo, you can have them be hosted somewhere else, so people who simply clone ccl to use it don't necessarily need to download them as well. For instance, I keep a couple larger files used in tests of TreeCorr on the repo of the TreeCorr wiki rather than the main TreeCorr code repo.
Here is a test that uses one of these: https://github.com/rmjarvis/TreeCorr/blob/releases/4.1/tests/test_gg.py#L969 And here is the get_from_wiki function I use to easily download it: https://github.com/rmjarvis/TreeCorr/blob/releases/4.1/tests/test_helper.py#L20
And if you're not aware, the wiki page associated with your main repo is also a git repo. So you can commit things there in the normal way. For CCL, use git clone https://github.com/LSSTDESC/CCL.wiki.git
. cf. https://gist.github.com/subfuzion/0d3f19c4f780a7d75ba2
BTW, feel free to copy wholesale the get_from_wiki
function in TreeCorr if you find it useful. Just give the TreeCorr url you got it from in an inline comment for attribution.
A fresh clone of CCL this 4/29/2020 weighs in at a whopping 230MB \o/ The current files on master account for 63 MB, and the rest is hidden in the git history.
So, I'm suspecting some of the benchmark data is to blame. I haven't kept a close eye on that, but I would say we probably should purge the
data
folder from the git history, and instead use a git-lfs folder for all data or massive files, to keep the repo light.Someone with more experience with current CCL should chime in though :-)