Closed chrisgorgo closed 5 years ago
Perhaps using something like this? --> https://rtyley.github.io/bfg-repo-cleaner/
While searching the .git
directory using the method below (stackoverflow), I also found a couple of other large files that might need to be purged.
git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk '/^blob/ {print substr($0,6)}' \
| sort --numeric-sort --key=2 \
| cut --complement --characters=13-40 \
| numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
This output is provided by adding | tail -n 20
to the code above:
5c467907f207 3,1MiB ds000246/sub-0001/meg/sub-0001_task-AEF_run-02_meg.ds/sub-0001_task-AEF_run-02_meg.res4
477443f9f473 7,4MiB ieeg_visual/sub-01/ses-01/anat/sub-01_ses-01_T1w_pial.R.surf.gii
5eb83321187d 15MiB ds000117/derivatives/mriqc/reports/sub-11_ses-mri_acq-mprage_T1w.html
0fd32b82227a 15MiB ds000117/derivatives/mriqc/reports/sub-08_ses-mri_acq-mprage_T1w.html
ff7a25de846a 15MiB ds000117/derivatives/mriqc/reports/sub-14_ses-mri_acq-mprage_T1w.html
cc896c93e5b5 16MiB ds000117/derivatives/mriqc/reports/sub-12_ses-mri_acq-mprage_T1w.html
12be3e03924f 16MiB ds000117/derivatives/mriqc/reports/sub-07_ses-mri_acq-mprage_T1w.html
08aeb541f638 16MiB ds000117/derivatives/mriqc/reports/sub-13_ses-mri_acq-mprage_T1w.html
f3dffd12c007 16MiB ds000117/derivatives/mriqc/reports/sub-15_ses-mri_acq-mprage_T1w.html
42cbfd14be3c 16MiB ds000117/derivatives/mriqc/reports/sub-09_ses-mri_acq-mprage_T1w.html
12a27e26f2fe 16MiB ds000117/derivatives/mriqc/reports/sub-03_ses-mri_acq-mprage_T1w.html
448553a5b8ee 16MiB ds000117/derivatives/mriqc/reports/sub-10_ses-mri_acq-mprage_T1w.html
765f03490b18 17MiB ds000117/derivatives/mriqc/reports/sub-05_ses-mri_acq-mprage_T1w.html
aeb918749ffa 17MiB ds000117/derivatives/mriqc/reports/sub-02_ses-mri_acq-mprage_T1w.html
81916e6f48bd 17MiB ds000117/derivatives/mriqc/reports/sub-04_ses-mri_acq-mprage_T1w.html
aed7fb66215c 18MiB ds000117/derivatives/mriqc/reports/sub-06_ses-mri_acq-mprage_T1w.html
4ff87f522bb5 18MiB ds000117/derivatives/mriqc/reports/sub-01_ses-mri_acq-mprage_T1w.html
203ec7f85d65 19MiB ds000117/derivatives/mriqc/reports/sub-16_ses-mri_acq-mprage_T1w.html
015f6c374986 49MiB ieeg_visual/stimuli/sub-01_ses-01_task-visual_run-01_stimuli.mat
00495968804c 83MiB ds000246/sub-emptyroom/meg/sub-emptyroom_task-noise_run-01_meg.ds/sub-emptyroom_task-noise_run-01_meg.meg4
We can also take this opportunity and delete some stale branches:
I am talking about the lower 4 in the dropdown menu of the image. Any good reason not to clean that up? What exactly are those chrisfilo-patch*
branches @chrisfilo ? Also to be cleaned?
I think the purging of the history should be done after the iEEG and EEG branches are merged. This will simplify this whole procedure.
In the past, we accidentally pushed huge files to the repository. They were purged in #124 ... but they are still in the git history.
The repository is ~10 times as big as it should be (570MB) ... so it's slow for cloning
git clone --depth 1
, and the cloning will be fastpush upstream master
, and not a PRThis is a big question, so we definitely need input from several people on this.
@effigies @yarikoptic @tyarkoni @robertoostenveld @choldgraf @dorahermes
I'm a fan of options 2 and 3 - I think having a repository this large makes it prohibitive for most people to download unless they're really motivated to do so...
Agree with @choldgraf. (2) is pretty extreme though. I guess we could start with (3) and fall back on (2) if things go awry or it turns out to be more difficult than anticipated.
(3) sounds good. There are only two PRs to rebase onto the new history, so helping people (one of them being me...) do that will not be a huge burden.
I wondered (since never used it myself) if "git graft" mechanism could be used in addition to 3 to mark some commit(s) in the past of the old "heavy" history to match corresponding ones in the "new light" history. That should (theoretically, if my understanding is correct) to allow people to proceed with their existing clones/histories as "nothing has happened" while new objects from the new history would come to replace old(er) ones, and new clones would be lightweight. Also not sure how tags would behave etc. There exists an open issue in BFG on that: https://github.com/rtyley/bfg-repo-cleaner/issues/82
BUT given the nature/purpose of bids-examples, I think that pure "3" would be just fine ;)
I also prefer 3.
If "git graft" does not work, it would be nice (but not required) to give instructions for others how to resolve/clean their clones. The easy solution that I would probably take is to delete my fork and all local clones and make a new one. Instructing the 54 people that now have a fork on github (see members) would already help a lot.
Alright, I'll then attempt number 3 (BFG repo cleaner) in the next days.
I already put a branch protection for master
in place, so nobody will be able to accidentally pollute the history by pushing to it.
In addition, I agree with @robertoostenveld that it might be a good idea to notify all people who have forked the repository.
Let's see how this goes!
git clone https://github.com/bids-standard/bids-examples.git
as the backupgit clone --mirror https://github.com/bids-standard/bids-examples.git
as the version to be prunedjava -jar bfg-1.13.0.jar --strip-blobs-bigger-than 1M bids-examples.git
bfg as downloaded from https://rtyley.github.io/bfg-repo-cleaner/cd bids-examples.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push
so far so good, the .git
directories are
the push gave some errors, click on details:
... apparently, this is a known problem, see e.g., this SO post, and this issue on BFG: https://github.com/rtyley/bfg-repo-cleaner/issues/36
The problem is that we cannot clean the history that was introduced to our repo via other remotes (PRs)
I am in favor of 1. or 2.
Any opinions, hints, comments? :-)
I'm in favor of:
bids-examples-old
and keep the history for archival purposes.bids-examples
. Create a "what happened to this repository's history?" section that links to the archived version.Thoughts?
+1
On 12 May 2019, at 20:18, Chris Holdgraf notifications@github.com wrote:
I'm in favor of:
Rename this repository to something like bids-examples-old and keep the history for archival purposes. Start fresh with a new repository that lives at bids-examples. Create a "what happened to this repository's history?" section that links to the archived version. Thoughts?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bids-standard/bids-examples/issues/119#issuecomment-491617363, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG3PYYGZEC7FVG7ZVUFH4TPVBNQRANCNFSM4FRYSQKA.
@choldgraf That sounds reasonable to me.
There was also discussion in #159 about removing some of the example datasets. Does it make sense to do that during the move, too?
@effigies my 2 cents: while I'm a fan of removing some example datasets, I'd decouple that action from the action of cleaning up history and reducing the size of the repo...it's hard enough making decisions in a distributed fashion :-)
TL;DR: BFG worked, repo is now small, history rewritten, everybody delete their forks/clones and make fresh ones
Although I initially liked the suggestion of making a new repo, I got worried because I remembered that this might have some consequences for the bids-validator
, which makes use of the releases in this repository.
As a result, I desperately looked at my process again to find issues ... and there you go :man_facepalming:
In https://github.com/bids-standard/bids-examples/issues/119#issuecomment-491607888, it did not work, because the branch protection was set to "on" and the settings didn't allow anyone to push.
On the good side: I disabled branch protection and then the cleaning worked very well. Our git history is ~80MB, and the repository with all files ~130MB --> feel free to check by making a fresh clone: git clone https://github.com/bids-standard/bids-validator
Another good thing: Our tags and releases are not affected.
The only issue that remains is that the GitHub references of all PRs that ever happened to this repository are still intact (albeit now pointing to a re-written history). The only way to solve it is by contacting GitHub support, asking them to delete those refs ... but I don't see a reason to do that, because we don't have sensitive data in these refs. As long as we don't reopen and merge these old PRs, everything will be fine.
On that note: everybody needs to delete their old forks and clones ... and we need to be extra diligent with merging PRs, always making sure that they are based on our new history.
PS: Next up, I'll make a new release of the bids-examples, which will include the EEG and iEEG data
Feel free to comment if you have concerns, see: https://github.com/bids-standard/bids-examples/issues/158
We should zero the large MEG files and purge the history.