WordPress / gutenberg

The Block Editor project for WordPress and beyond. Plugin is available from the official repository.
https://wordpress.org/gutenberg/
Other
10.52k stars 4.2k forks source link

Contributing: Decrease the size of the source code needed to be downloaded #29008

Closed gziolo closed 3 years ago

gziolo commented 3 years ago

Similar to #26993.

What problem does this address?

It takes ages to finish:

git clone git@github.com:WordPress/gutenberg.git

At the moment the size of the repository is over 2GB!!!!

Screen Shot 2021-02-11 at 15 17 11

If you add to the mix that you need to run on every brand new repository:

npm install

It adds another 1GB of data that needs to be downloaded as reported in #26993.

What is your proposed solution?

It makes me think that maybe gh-pages branch is one of the reasons why the size of the repository has grown so much. We replace the content of gh-pages with the new build of Storybook on every commit to the main branch.

I don't know how this sort of issues are usually solved in git-based repositories, but the comment from WordPress Slack (link requires registration at https://make.wordpress.org/chat/) authored by @ocean90 should be a good start:

https://wordpress.slack.com/archives/C5UNMSU4R/p1609864617204200?thread_ts=1609770083.149700&cid=C5UNMSU4R

Yes by creating a new orphan branch from gh-pages. You have to add the files there and the gh-pages branch needs to be deleted. Then rename the new branch to gh-pages which finally gets force-pushed. This post documents the steps

The link included: https://medium.com/@sangeethkumar.tvm.kpm/cleaning-up-a-git-repo-for-reducing-the-repository-size-d11fa496ba48

This article contains some techniques that could help with gh-pages where we don't care about history at all. There are also several interesting references to other similar articles that try to address similar issues.

iandunn commented 3 years ago

This command lists files that are larger than 1MB (requires brew install coreutils on OS X):

git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | awk '$2 >= 2^20' | sort --numeric-sort --key=2 | gcut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

Results are at https://pastebin.com/hQaYcHwE

Those includes files in HEAD, though. I wasn't able to filter them out, but it should be possible.

I suspect there's a long tail of files < 1MB that could also be removed for a significant boost.

Removing files from the history would break the hashes, but might be worth it in this case.

gziolo commented 3 years ago

@iandunn, that's very helpful. Thank you for doing a more in-depth investigation.

Results are at https://pastebin.com/hQaYcHwE

@hypest, I see a lot of mobile-related files with the highest impact that don't look like source code. Would it be possible to remove some of them from the git history so we could make the Gutenberg repository faster to download?

59MiB diffcheck.txt

This one alone looks like a quick win if we erase the full history.

hypest commented 3 years ago

59MiB diffcheck.txt

This one alone looks like a quick win if we erase the full history.

Aha, I'm not familiar with what that file is but I'm sure there are savings to be made along those lines. @ceyhun , you think you can take a look when you get some chance, possibly after HACK week of March 2021? Thanks!

iandunn commented 3 years ago

git clone https://github.com/WordPress/gutenberg.git --depth 10 might also be interesting. That clones the repo with just the last 10 commits. It only took ~5 seconds to download on my machine.

I'm guessing someone could use that, then send a PR, and it wouldn't have any problems. I haven't tested that, though. The downside is that people would have to intentionally do it, since it's not the default behavior. Scripts and docs could be updated, though.

If we do remove stuff from history, I'd recommend getting rid of everything all at once if possible. Changing the history will break lots of stuff, so we'd probably only want to do it once every few years, at most.

windhamdavid commented 3 years ago

I tested the --depth. Faster but doen't do much for size. 2.2GB is 2much :scissors: Might consider something like SVN subtree using a filter-branch -f --prune-empty --subdirectory-filter and split off the docs(gh-pages) and tests. cc: https://github.com/WordPress/gutenberg/issues/26993#issuecomment-728877480

ceyhun commented 3 years ago

59MiB diffcheck.txt

This one alone looks like a quick win if we erase the full history.

Aha, I'm not familiar with what that file is but I'm sure there are savings to be made along those lines. @ceyhun , you think you can take a look when you get some chance, possibly after HACK week of March 2021? Thanks!

@gziolo I found the commits for this file using git log --all --full-history -- diffcheck.txt command. This seems to be the latest one deleting it: https://github.com/WordPress/gutenberg/commit/397e645d9994a528b13f14d68341c32414a374ad. It seems like an out of a git diff. I think we can safely erase this.

@hypest I also checked the pastebin and saw a lot of mobile bundles and binaries. I think it's also fine to erase these ones as we're not using them and they can also be regenerated if needed:

bundle/android/App.js
bundle/android/App.js.map
bundle/ios/App.js
bundle/ios/App.js.map
test/native/gutenberg-mobile-demo-app.apk
ios/Gutenberg.app.zip
gziolo commented 3 years ago

@ceyhun, this is a great finding. How can we perform this cleanup? Is it something you could do yourself?

hypest commented 3 years ago

@hypest I also checked the pastebin and saw a lot of mobile bundles and binaries. I think it's also fine to erase these ones as we're not using them and they can also be regenerated if needed

+1 for removing the Android ones, the APK and the app.zip, no questions asked.

For the iOS ones though, it will probably make trying out/debugging older WPiOS versions harder, right? Recreating them will probably be cumbersome. I actually don't like that we had to commit the JS bundles at all so, if you feel confident about iOS debugging without having the bundles readymade @ceyhun then I'm +1.

ceyhun commented 3 years ago

@ceyhun, this is a great finding. How can we perform this cleanup? Is it something you could do yourself?

@gziolo I'm not really sure what git magic is needed for this to happen πŸͺ„ I also do not consider myself a git magician πŸ˜ƒ So any help would be appreciated.

For the iOS ones though, it will probably make trying out/debugging older WPiOS versions harder, right? Recreating them will probably be cumbersome. I actually don't like that we had to commit the JS bundles at all so, if you feel confident about iOS debugging without having the bundles readymade @ceyhun then I'm +1.

@hypest I think WPiOS was always using the bundles on gutenberg-mobile repo and seems like that one goes back as far as 2018, so maybe that's enough?

jonathanbossenger commented 3 years ago

@gziolo @ceyhun based on this SO answer, git filter-branch should allow you to completely remove those files from the repo history https://stackoverflow.com/questions/43762338/how-to-remove-file-from-git-history#43762489

gziolo commented 3 years ago

We discussed options on WordPress Slack in the #meta channel (link requires registration at https://make.wordpress.org/chat/): https://wordpress.slack.com/archives/C02QB8GMM/p1616519854024400

@dd32 shared the following:

Playing with git rev-list --disk-usage shows that the gh-pages branch is 10x the next:

26.67MB refs/tags/@wordpress/warning@1.4.0
26.67MB refs/tags/@wordpress/wordcount@2.15.0
112.10MB refs/remotes/origin/import-gutenberg-mobile
117.89MB refs/remotes/origin/rnmobile/import-mobile-lint
119.86MB refs/remotes/origin/rnmobile/import-mobile-fix-ci
158.28MB refs/remotes/origin/rnmobile/try-fix-android-build
169.98MB refs/remotes/origin/feat/import-gutenberg-mobile-no-squash-E2E-TESTS-fix-ios-ci
182.22MB refs/remotes/origin/rnmobile/experiment-monorepo-new-setup-update-node
213.09MB refs/tags/rnmobile/monorepo-commit-history
2053.26MB refs/remotes/origin/gh-pages

I started with the first step and rewrote the history of gh-pages branch: https://github.com/WordPress/gutenberg/commits/gh-pages

It looks like it mostly generates new bundle files for the Storybook instance available at https://wordpress.github.io/gutenberg/.

Can you check if we can remove the mobile branches listed completely?

hypest commented 3 years ago

@hypest I think WPiOS was always using the bundles on gutenberg-mobile repo and seems like that one goes back as far as 2018, so maybe that's enough?

Oh, right @ceyhun. I don't think WPiOS was ever using the bundle directly from Gutenberg's repo, only from gutenberg-mobile. I see what you mean now so yeah, no need for the native mobile (RN) bundle inside Gutenberg's repo πŸ‘.

ceyhun commented 3 years ago

Can you check if we can remove the mobile branches listed completely?

@gziolo I went ahead and deleted the following mobile branches:

112.10MB refs/remotes/origin/import-gutenberg-mobile
117.89MB refs/remotes/origin/rnmobile/import-mobile-lint
119.86MB refs/remotes/origin/rnmobile/import-mobile-fix-ci
158.28MB refs/remotes/origin/rnmobile/try-fix-android-build
169.98MB refs/remotes/origin/feat/import-gutenberg-mobile-no-squash-E2E-TESTS-fix-ios-ci
182.22MB refs/remotes/origin/rnmobile/experiment-monorepo-new-setup-update-node

But I'm not sure about deleting this tag: rnmobile/monorepo-commit-history. We kept it so we can view gutenberg-mobile git history from before monorepo merge. I suppose it's also the tag/branch where most of the large bundle/android/App.js and bundle/ios/App.js files live in. It would be nice if we can rewrite history in that branch to not include the bundle files, but I'm not sure how it can be done and I can imagine that it could be a complex task.

Also on second thought, I think we can use gutenberg-mobile to view git history before monorepo as well. It would be harder to search and find a specific file from gutenberg repo in gutenberg-mobile back again just for its history, but I think it's possible. I also don't remember using rnmobile/monorepo-commit-history tag before to check the history of a file, and I think after monorepo I modified many files from RN Bridge, RN Aztec code and E2E tests which were in gutenberg-mobile before monorepo. Any thoughts @hypest?

hypest commented 3 years ago

Also on second thought, I think we can use gutenberg-mobile to view git history before monorepo as well. It would be harder to search and find a specific file from gutenberg repo in gutenberg-mobile back again just for its history, but I think it's possible. I also don't remember using rnmobile/monorepo-commit-history tag before to check the history of a file, and I think after monorepo I modified many files from RN Bridge, RN Aztec code and E2E tests which were in gutenberg-mobile before monorepo. Any thoughts @hypest?

Good point Ceyhun. The commit history is indeed available in gutenberg-mobile's repo, but I think it's quite hard to connect the dots as that repo has also moved on. All in all, I'd prefer if we keep the rnmobile/monorepo-commit-history for some more time. Anecdotally, I did use that branch a couple of weeks ago while trying to understand the code history of how selection messages get triggered on the Aztec wrapper on Android (to fix an important regression).

gziolo commented 3 years ago

It looks like the changes applied so far had an impressive impact on the repository size:

Screen Shot 2021-03-26 at 12 34 25

Do you think we can further decrease the size or is it fine to close this issue for now?

ceyhun commented 3 years ago

Do you think we can further decrease the size or is it fine to close this issue for now?

We're thinking of keeping a fork of gutenberg just for the rnmobile/monorepo-commit-history tag and maybe we can delete it here then. It would be worth keeping this open a little while longer while we figure this out.

Thanks @mchowning for coming up with that idea!

gziolo commented 3 years ago

We can wait a few more weeks, no worries. The smaller size of the download necessary to clone the repository is worth it πŸ˜„

Thank you for all the help so far πŸ™‡πŸ»

ceyhun commented 3 years ago

@gziolo just created a fork wordpress-mobile/gutenberg-rnmobile-monorepo-commit-history to keep the history and deleted the rnmobile/monorepo-commit-history tag. Seems like this lowered the size even more:

gutenberg-clone
gziolo commented 3 years ago

This is great. The only remaining task would be to improve the GitHub workflow that uses gh-pages to update Storybook to always recreate the branch from scratch to ignore its history.

gziolo commented 3 years ago

@ockham, how much work it would be to run on gh-pages branch in GitHub workflow something like:

git checkout β€” orphan latest_branch
git add -A
git commit -am β€œInitial commit message” #Committing the changes
git branch -D master #Deleting master branch
git branch -m master #renaming branch as master
git push -f origin master #pushes to master branch
git gc β€” aggressive β€” prune=all # remove the old files

I don't remember what I used exactly before, but it was similar and it remove all git history for gh-pages and ideally we would run it every time we update Storybook. The alternative would be to use another repository.

ockham commented 3 years ago

@ockham, how much work it would be to run on gh-pages branch in GitHub workflow something like:

git checkout β€” orphan latest_branch
git add -A
git commit -am β€œInitial commit message” #Committing the changes
git branch -D master #Deleting master branch
git branch -m master #renaming branch as master
git push -f origin master #pushes to master branch
git gc β€” aggressive β€” prune=all # remove the old files

I don't remember what I used exactly before, but it was similar and it remove all git history for gh-pages and ideally we would run it every time we update Storybook.

Looks like it shouldn't be too much work; basically, any workflow that uses @actions/checkout automatically gets a GH token that enables it to perform git operations. Would we want to add that to the .github/workflows/storybook-pages.yml workflow?

For me, the bigger question seems to be if we really want to routinely rewrite the history of our gh-pages branch πŸ€” Which brings us to your alternative suggestion...

The alternative would be to use another repository.

Wouldn't that maybe make more sense? If we've identified that:

... why not keep things nicely separated, create a dedicated wordpress.github.io repository, and have the workflow deploy to that?

gziolo commented 3 years ago

Would we want to add that to the .github/workflows/storybook-pages.yml workflow?

Yes.

... why not keep things nicely separated, create a dedicated wordpress.github.io repository, and have the workflow deploy to that?

It was discussed as well. Whatever works best here πŸ˜„

ockham commented 3 years ago

... why not keep things nicely separated, create a dedicated wordpress.github.io repository, and have the workflow deploy to that?

It was discussed as well. Whatever works best here πŸ˜„

I'm leaning towards the latter, TBH. Seems fairly straight-forward. The main questions are probably if creating a new wordpress.github.io repo (at org level) will collide with the existing wordpress.github.io/gutenberg/ pages (at GB repo level, created from the gh-pages branch); and if we'll be able to retain the /gutenberg path somehow πŸ™‚

ockham commented 3 years ago

... why not keep things nicely separated, create a dedicated wordpress.github.io repository, and have the workflow deploy to that?

It was discussed as well. Whatever works best here πŸ˜„

I'm leaning towards the latter, TBH. Seems fairly straight-forward. The main questions are probably if creating a new wordpress.github.io repo (at org level) will collide with the existing wordpress.github.io/gutenberg/ pages (at GB repo level, created from the gh-pages branch); and if we'll be able to retain the /gutenberg path somehow πŸ™‚

Looks like we might even be able to continue using the same GH action we're using now: It supports both deploying to a different repo, and to a subdir (not entirely sure if those can be combined). For the different repo, we need a personal access token -- rather than GITHUB_TOKEN -- but we can simply use one for the @gutenbergplugin user account.


Oh, I just noticed that if we wanna go ahead with pruning the history of the gh-pages branch instead, the GH action might support that as well OOTB.

gziolo commented 3 years ago

I see https://github.com/peaceiris/actions-gh-pages#%EF%B8%8F-force-orphan-force_orphan. This is exactly what we want and it makes it so much easier to approach this way. I will merge directly to trunk and see if it works. Great discovery @ockham!

gziolo commented 3 years ago

It worked with https://github.com/WordPress/gutenberg/commit/d4bef28c06edd38d97dc0f6f649dbf8cf248c321:

Screen Shot 2021-04-23 at 08 33 28

We are now at 200-ish MB, which is 10% of the initial size:

Screen Shot 2021-04-23 at 08 59 42

Many thanks to everyone involved.