Closed gziolo closed 3 years ago
This command lists files that are larger than 1MB (requires brew install coreutils
on OS X):
git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | awk '$2 >= 2^20' | sort --numeric-sort --key=2 | gcut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
Results are at https://pastebin.com/hQaYcHwE
Those includes files in HEAD
, though. I wasn't able to filter them out, but it should be possible.
I suspect there's a long tail of files < 1MB
that could also be removed for a significant boost.
Removing files from the history would break the hashes, but might be worth it in this case.
@iandunn, that's very helpful. Thank you for doing a more in-depth investigation.
Results are at https://pastebin.com/hQaYcHwE
@hypest, I see a lot of mobile-related files with the highest impact that don't look like source code. Would it be possible to remove some of them from the git history so we could make the Gutenberg repository faster to download?
59MiB diffcheck.txt
This one alone looks like a quick win if we erase the full history.
59MiB diffcheck.txt
This one alone looks like a quick win if we erase the full history.
Aha, I'm not familiar with what that file is but I'm sure there are savings to be made along those lines. @ceyhun , you think you can take a look when you get some chance, possibly after HACK week of March 2021? Thanks!
git clone https://github.com/WordPress/gutenberg.git --depth 10
might also be interesting. That clones the repo with just the last 10 commits. It only took ~5 seconds to download on my machine.
I'm guessing someone could use that, then send a PR, and it wouldn't have any problems. I haven't tested that, though. The downside is that people would have to intentionally do it, since it's not the default behavior. Scripts and docs could be updated, though.
If we do remove stuff from history, I'd recommend getting rid of everything all at once if possible. Changing the history will break lots of stuff, so we'd probably only want to do it once every few years, at most.
I tested the --depth
. Faster but doen't do much for size. 2.2GB is 2much :scissors: Might consider something like SVN subtree using a filter-branch -f --prune-empty --subdirectory-filter
and split off the docs(gh-pages) and tests. cc: https://github.com/WordPress/gutenberg/issues/26993#issuecomment-728877480
59MiB diffcheck.txt
This one alone looks like a quick win if we erase the full history.
Aha, I'm not familiar with what that file is but I'm sure there are savings to be made along those lines. @ceyhun , you think you can take a look when you get some chance, possibly after HACK week of March 2021? Thanks!
@gziolo I found the commits for this file using git log --all --full-history -- diffcheck.txt
command. This seems to be the latest one deleting it: https://github.com/WordPress/gutenberg/commit/397e645d9994a528b13f14d68341c32414a374ad. It seems like an out of a git diff
. I think we can safely erase this.
@hypest I also checked the pastebin and saw a lot of mobile bundles and binaries. I think it's also fine to erase these ones as we're not using them and they can also be regenerated if needed:
bundle/android/App.js
bundle/android/App.js.map
bundle/ios/App.js
bundle/ios/App.js.map
test/native/gutenberg-mobile-demo-app.apk
ios/Gutenberg.app.zip
@ceyhun, this is a great finding. How can we perform this cleanup? Is it something you could do yourself?
@hypest I also checked the pastebin and saw a lot of mobile bundles and binaries. I think it's also fine to erase these ones as we're not using them and they can also be regenerated if needed
+1 for removing the Android ones, the APK and the app.zip, no questions asked.
For the iOS ones though, it will probably make trying out/debugging older WPiOS versions harder, right? Recreating them will probably be cumbersome. I actually don't like that we had to commit the JS bundles at all so, if you feel confident about iOS debugging without having the bundles readymade @ceyhun then I'm +1.
@ceyhun, this is a great finding. How can we perform this cleanup? Is it something you could do yourself?
@gziolo I'm not really sure what git magic is needed for this to happen πͺ I also do not consider myself a git magician π So any help would be appreciated.
For the iOS ones though, it will probably make trying out/debugging older WPiOS versions harder, right? Recreating them will probably be cumbersome. I actually don't like that we had to commit the JS bundles at all so, if you feel confident about iOS debugging without having the bundles readymade @ceyhun then I'm +1.
@hypest I think WPiOS was always using the bundles on gutenberg-mobile repo and seems like that one goes back as far as 2018, so maybe that's enough?
@gziolo @ceyhun based on this SO answer, git filter-branch
should allow you to completely remove those files from the repo history https://stackoverflow.com/questions/43762338/how-to-remove-file-from-git-history#43762489
We discussed options on WordPress Slack in the #meta channel (link requires registration at https://make.wordpress.org/chat/): https://wordpress.slack.com/archives/C02QB8GMM/p1616519854024400
@dd32 shared the following:
Playing with git rev-list --disk-usage
shows that the gh-pages
branch is 10x the next:
26.67MB refs/tags/@wordpress/warning@1.4.0
26.67MB refs/tags/@wordpress/wordcount@2.15.0
112.10MB refs/remotes/origin/import-gutenberg-mobile
117.89MB refs/remotes/origin/rnmobile/import-mobile-lint
119.86MB refs/remotes/origin/rnmobile/import-mobile-fix-ci
158.28MB refs/remotes/origin/rnmobile/try-fix-android-build
169.98MB refs/remotes/origin/feat/import-gutenberg-mobile-no-squash-E2E-TESTS-fix-ios-ci
182.22MB refs/remotes/origin/rnmobile/experiment-monorepo-new-setup-update-node
213.09MB refs/tags/rnmobile/monorepo-commit-history
2053.26MB refs/remotes/origin/gh-pages
I started with the first step and rewrote the history of gh-pages
branch:
https://github.com/WordPress/gutenberg/commits/gh-pages
It looks like it mostly generates new bundle files for the Storybook instance available at https://wordpress.github.io/gutenberg/.
Can you check if we can remove the mobile branches listed completely?
@hypest I think WPiOS was always using the bundles on gutenberg-mobile repo and seems like that one goes back as far as 2018, so maybe that's enough?
Oh, right @ceyhun. I don't think WPiOS was ever using the bundle directly from Gutenberg's repo, only from gutenberg-mobile. I see what you mean now so yeah, no need for the native mobile (RN) bundle inside Gutenberg's repo π.
Can you check if we can remove the mobile branches listed completely?
@gziolo I went ahead and deleted the following mobile branches:
112.10MB refs/remotes/origin/import-gutenberg-mobile
117.89MB refs/remotes/origin/rnmobile/import-mobile-lint
119.86MB refs/remotes/origin/rnmobile/import-mobile-fix-ci
158.28MB refs/remotes/origin/rnmobile/try-fix-android-build
169.98MB refs/remotes/origin/feat/import-gutenberg-mobile-no-squash-E2E-TESTS-fix-ios-ci
182.22MB refs/remotes/origin/rnmobile/experiment-monorepo-new-setup-update-node
But I'm not sure about deleting this tag: rnmobile/monorepo-commit-history
. We kept it so we can view gutenberg-mobile git history from before monorepo merge. I suppose it's also the tag/branch where most of the large bundle/android/App.js
and bundle/ios/App.js
files live in. It would be nice if we can rewrite history in that branch to not include the bundle
files, but I'm not sure how it can be done and I can imagine that it could be a complex task.
Also on second thought, I think we can use gutenberg-mobile to view git history before monorepo as well. It would be harder to search and find a specific file from gutenberg repo in gutenberg-mobile back again just for its history, but I think it's possible. I also don't remember using rnmobile/monorepo-commit-history
tag before to check the history of a file, and I think after monorepo I modified many files from RN Bridge, RN Aztec code and E2E tests which were in gutenberg-mobile before monorepo. Any thoughts @hypest?
Also on second thought, I think we can use gutenberg-mobile to view git history before monorepo as well. It would be harder to search and find a specific file from gutenberg repo in gutenberg-mobile back again just for its history, but I think it's possible. I also don't remember using rnmobile/monorepo-commit-history tag before to check the history of a file, and I think after monorepo I modified many files from RN Bridge, RN Aztec code and E2E tests which were in gutenberg-mobile before monorepo. Any thoughts @hypest?
Good point Ceyhun. The commit history is indeed available in gutenberg-mobile's repo, but I think it's quite hard to connect the dots as that repo has also moved on. All in all, I'd prefer if we keep the rnmobile/monorepo-commit-history
for some more time. Anecdotally, I did use that branch a couple of weeks ago while trying to understand the code history of how selection messages get triggered on the Aztec wrapper on Android (to fix an important regression).
It looks like the changes applied so far had an impressive impact on the repository size:
Do you think we can further decrease the size or is it fine to close this issue for now?
Do you think we can further decrease the size or is it fine to close this issue for now?
We're thinking of keeping a fork of gutenberg
just for the rnmobile/monorepo-commit-history
tag and maybe we can delete it here then. It would be worth keeping this open a little while longer while we figure this out.
Thanks @mchowning for coming up with that idea!
We can wait a few more weeks, no worries. The smaller size of the download necessary to clone the repository is worth it π
Thank you for all the help so far ππ»
@gziolo just created a fork wordpress-mobile/gutenberg-rnmobile-monorepo-commit-history to keep the history and deleted the rnmobile/monorepo-commit-history
tag. Seems like this lowered the size even more:
This is great. The only remaining task would be to improve the GitHub workflow that uses gh-pages
to update Storybook to always recreate the branch from scratch to ignore its history.
@ockham, how much work it would be to run on gh-pages
branch in GitHub workflow something like:
git checkout β orphan latest_branch
git add -A
git commit -am βInitial commit messageβ #Committing the changes
git branch -D master #Deleting master branch
git branch -m master #renaming branch as master
git push -f origin master #pushes to master branch
git gc β aggressive β prune=all # remove the old files
I don't remember what I used exactly before, but it was similar and it remove all git history for gh-pages
and ideally we would run it every time we update Storybook. The alternative would be to use another repository.
@ockham, how much work it would be to run on
gh-pages
branch in GitHub workflow something like:git checkout β orphan latest_branch git add -A git commit -am βInitial commit messageβ #Committing the changes git branch -D master #Deleting master branch git branch -m master #renaming branch as master git push -f origin master #pushes to master branch git gc β aggressive β prune=all # remove the old files
I don't remember what I used exactly before, but it was similar and it remove all git history for
gh-pages
and ideally we would run it every time we update Storybook.
Looks like it shouldn't be too much work; basically, any workflow that uses @actions/checkout
automatically gets a GH token that enables it to perform git
operations. Would we want to add that to the .github/workflows/storybook-pages.yml
workflow?
For me, the bigger question seems to be if we really want to routinely rewrite the history of our gh-pages
branch π€ Which brings us to your alternative suggestion...
The alternative would be to use another repository.
Wouldn't that maybe make more sense? If we've identified that:
gh-pages
branch is too big... why not keep things nicely separated, create a dedicated wordpress.github.io
repository, and have the workflow deploy to that?
Would we want to add that to the .github/workflows/storybook-pages.yml workflow?
Yes.
... why not keep things nicely separated, create a dedicated wordpress.github.io repository, and have the workflow deploy to that?
It was discussed as well. Whatever works best here π
... why not keep things nicely separated, create a dedicated wordpress.github.io repository, and have the workflow deploy to that?
It was discussed as well. Whatever works best here π
I'm leaning towards the latter, TBH. Seems fairly straight-forward. The main questions are probably if creating a new wordpress.github.io
repo (at org level) will collide with the existing wordpress.github.io/gutenberg/
pages (at GB repo level, created from the gh-pages
branch); and if we'll be able to retain the /gutenberg
path somehow π
... why not keep things nicely separated, create a dedicated wordpress.github.io repository, and have the workflow deploy to that?
It was discussed as well. Whatever works best here π
I'm leaning towards the latter, TBH. Seems fairly straight-forward. The main questions are probably if creating a new
wordpress.github.io
repo (at org level) will collide with the existingwordpress.github.io/gutenberg/
pages (at GB repo level, created from thegh-pages
branch); and if we'll be able to retain the/gutenberg
path somehow π
Looks like we might even be able to continue using the same GH action we're using now: It supports both deploying to a different repo, and to a subdir (not entirely sure if those can be combined). For the different repo, we need a personal access token -- rather than GITHUB_TOKEN
-- but we can simply use one for the @gutenbergplugin user account.
Oh, I just noticed that if we wanna go ahead with pruning the history of the gh-pages
branch instead, the GH action might support that as well OOTB.
I see https://github.com/peaceiris/actions-gh-pages#%EF%B8%8F-force-orphan-force_orphan. This is exactly what we want and it makes it so much easier to approach this way. I will merge directly to trunk
and see if it works. Great discovery @ockham!
It worked with https://github.com/WordPress/gutenberg/commit/d4bef28c06edd38d97dc0f6f649dbf8cf248c321:
We are now at 200-ish MB, which is 10% of the initial size:
Many thanks to everyone involved.
Similar to #26993.
What problem does this address?
It takes ages to finish:
At the moment the size of the repository is over 2GB!!!!
If you add to the mix that you need to run on every brand new repository:
It adds another 1GB of data that needs to be downloaded as reported in #26993.
What is your proposed solution?
It makes me think that maybe
gh-pages
branch is one of the reasons why the size of the repository has grown so much. We replace the content ofgh-pages
with the new build of Storybook on every commit to the main branch.I don't know how this sort of issues are usually solved in git-based repositories, but the comment from WordPress Slack (link requires registration at https://make.wordpress.org/chat/) authored by @ocean90 should be a good start:
https://wordpress.slack.com/archives/C5UNMSU4R/p1609864617204200?thread_ts=1609770083.149700&cid=C5UNMSU4R
The link included: https://medium.com/@sangeethkumar.tvm.kpm/cleaning-up-a-git-repo-for-reducing-the-repository-size-d11fa496ba48
This article contains some techniques that could help with
gh-pages
where we don't care about history at all. There are also several interesting references to other similar articles that try to address similar issues.