LSSTDESC / CCL

DESC Core Cosmology Library: cosmology routines with validated numerical accuracy
BSD 3-Clause "New" or "Revised" License
141 stars 64 forks source link

CCL git repository is insanely large #770

Closed EiffL closed 4 years ago

EiffL commented 4 years ago

A fresh clone of CCL this 4/29/2020 weighs in at a whopping 230MB \o/ The current files on master account for 63 MB, and the rest is hidden in the git history.

So, I'm suspecting some of the benchmark data is to blame. I haven't kept a close eye on that, but I would say we probably should purge the data folder from the git history, and instead use a git-lfs folder for all data or massive files, to keep the repo light.

Someone with more experience with current CCL should chime in though :-)

EiffL commented 4 years ago

Hum.... also, looks like the pdfs are committed every time... that's about 500KB everytime, quickly adds up

c-d-leonard commented 4 years ago

So @EiffL do you have a suggestion about what to do instead of committing the pdfs? I agree it's dumb (@rmjarvis just pointed this out too and I opened an issue #768 ) - he said that his policy is to just make the pdf himself in master pre-releases, however we have always historically had a total embargo on anybody pushing anything to master. We could make an exception for this, but I'm curious if you have another good workaround.

damonge commented 4 years ago

I would argue that no one other than ourselves is looking at the note, and that we could just remove its pdf from the repo (and ask users to compile it themselves)

EiffL commented 4 years ago

We "could" setup some travis magic that execute a "make" in the doc folders upon for instance a new release. not sure what's the best trigger for this

tilmantroester commented 4 years ago

Was the paper branch squash merged? Else there might be a lot of figures etc in the git history there.

rmjarvis commented 4 years ago

I think there is not much utility in having the pdf in the git repo. Perhaps you could periodically put a current version on the wiki. Say every time you release a new version, update the note on the wiki.

EiffL commented 4 years ago

(just need to make sure the paper compiles, through travis I guesss)

rmjarvis commented 4 years ago

You could certainly add that as a "unit" test. Run make in the paper directories and ensure that the pdf was generated.

c-d-leonard commented 4 years ago

I like the idea of having it on the wiki. @elisachisari do you have thoughts on this from your experience with the paper and note?

rmjarvis commented 4 years ago

FYI about the wiki, it has its own git repo, which in my experience is often easier to deal with than the web interface. cf. https://help.github.com/en/github/building-a-strong-community/adding-or-editing-wiki-pages#adding-or-editing-wiki-pages-locally

EiffL commented 4 years ago

So everyone, maybe also @beckermr, I have only used a few times the BFG repo-cleaner: https://rtyley.github.io/bfg-repo-cleaner/ to purge large files from git history.

It worked fine for me, but you have to be careful not to push from an old clone of the repo after cleaning because the git histories will be incompatible. I don't actually know what happens if you try to push, but probably nothing good.

The problem with a project like CCL is that a lot of people have a lot of cloned version, and I'm afraid that someone might try to force a push of an old clone and break everything.

Does anyone have more experience with purging files from an active git repo with lots of contributors?

beckermr commented 4 years ago

I have been down this road before. To be honest, this is something we are going to have to live with. Repo cleaning tools can help, but they are not failsafe. Aggressive cleaning of repos can make them unusable too.

rmjarvis commented 4 years ago

I had to do this once for Piff when someone pushed up some huge files intended for testing. After getting rid of them from history, the recommendation for everyone who had pulled this particular branch in the meanwhile was

git fetch
git rebase
git reflog expire --expire=now --all
git gc --aggressive --prune=now

For your case, I guess you'd want to do that on master and then rebase any extant branches to the new master. So best if you can manage to merge most things into master before doing this, so people only have a simple set of commands they need to do on their own home versions.

Also, I did this by hand with git --filter-branch which felt a bit scary, but worked out ok. I bet a tool like bfg would be easier. (Is that a reference to Doom or Dahl you think? Or neither?)

jablazek commented 4 years ago

@rmjarvis @EiffL @beckermr : throwing out a random idea. To avoid some of the issues with cleaning the repo's history, what if we just renamed and archived the current repo and started using a fresh one with the large files no longer tracked? Anyone developing from the current repo (with work not merged before we do this break) would need to cherry pick individual files into a new PR. But there aren't that many developers, and this would ensure that that future pushes don't conflict with the changed history. Maybe too extreme?

beckermr commented 4 years ago

I’m ok with the repo size. I’d just leave it.

EiffL commented 4 years ago

I think if we are careful we can trim the fat from this repo and nothing bad will happen, we can make backups and restore things afterwards if need be, just to safe.

I could do it, but I would want one or two extra pairs of eyes on it, just to make sure I don't wreck everything by accident.

Alternatively, since apparently not many people have complained besides me, maybe it's not such a big deal at this point, so we could just adopt some good practices for now, like not committing the PDFs and data files, and revisit when/if it becomes an issue for more people.

I'll leave that choice to the admins :-) I know in theory how to technically do both back up/restore and cleaning, but have never attempted on an active repo, so I admit that I'm afraid of wrecking things.

beckermr commented 4 years ago

I'll leave that choice to the admins :-) I know in theory how to technically do both back up/restore and cleaning, but have never attempted on an active repo, so I admit that I'm afraid of wrecking things.

Yep. We should let this one die. I am going to close for now.

tilmantroester commented 4 years ago

On the other hand, this would probably be easier to pull off now, rather than in 5 years.

rmjarvis commented 4 years ago

For some perspective on this, it currently takes 38 seconds to clone the repo on my laptop. That's not completely insane, but it is already comparable to some much older git repos that also include some data files for testing and have far more commit history than CCL. GalSim is 40 sec and TreeCorr is 18 sec.

So I'd at least think about ways to staunch the bleeding. Stop including the pdf in the history going forward. And probably move some of the data files to a wiki or git-ls repo so when you need to add more or update the existing ones, that will be the obvious place to do so, and it won't add to the CCL repo size.

beckermr commented 4 years ago

Stop including the pdf in the history going forward. And probably move some of the data files to a wiki or git-ls repo so when you need to add more or update the existing ones, that will be the obvious place to do so, and it won't add to the CCL repo size.

100% agreed!

c-d-leonard commented 4 years ago

So, I'm looking at doing the suggested staunching-of-bleeding now. @rmjarvis - I'm a little confused about your suggestion to move data files out of the repo, since they are used when running tests? How would that work?

damonge commented 4 years ago

For context, I checked yesterday and the class repo is >500 MB. Pretty crazy stuff.

rmjarvis commented 4 years ago

The main "stanching the bleeding" suggestion was to stop hosting the note pdf in the repo. Just have one of your unit tests be to build the pdf to make sure there aren't any tex errors, but don't commit the pdf.

As for how to use data files not in the repo, you can have them be hosted somewhere else, so people who simply clone ccl to use it don't necessarily need to download them as well. For instance, I keep a couple larger files used in tests of TreeCorr on the repo of the TreeCorr wiki rather than the main TreeCorr code repo.

Here is a test that uses one of these: https://github.com/rmjarvis/TreeCorr/blob/releases/4.1/tests/test_gg.py#L969 And here is the get_from_wiki function I use to easily download it: https://github.com/rmjarvis/TreeCorr/blob/releases/4.1/tests/test_helper.py#L20

And if you're not aware, the wiki page associated with your main repo is also a git repo. So you can commit things there in the normal way. For CCL, use git clone https://github.com/LSSTDESC/CCL.wiki.git. cf. https://gist.github.com/subfuzion/0d3f19c4f780a7d75ba2

rmjarvis commented 4 years ago

BTW, feel free to copy wholesale the get_from_wiki function in TreeCorr if you find it useful. Just give the TreeCorr url you got it from in an inline comment for attribution.