boltsparts / BOLTS_archive

BOLTS is a open library of technical specifications
GNU General Public License v3.0
270 stars 54 forks source link

Strip downloads from the git repo #163

Open KOLANICH opened 8 years ago

KOLANICH commented 8 years ago

Don't store large binary files in git repo. Every time you make a commit every file in your repo is copyed. That's why it is 184 MiB now. Use Github Releases to store downloads and use git-lfs to store large files and blobs if they are strictly needed.

berndhahnebach commented 4 years ago

since we have started to use github releases this can be closed.

KOLANICH commented 4 years ago

Could you strip all the binaries from the history?

berndhahnebach commented 4 years ago

This would on one side makes sense but on the other would change all commit ids. This is not a good behavior on a opensource repository as it would mean to do a force push. But I must admit it is a problem, thus reopen the issue.

May be a separate branch?

Or we retire this repo an use a new one with rewritten history.

KOLANICH commented 4 years ago

Just put a message instructing users to rebase their patches manually using git format-patch and git am

berndhahnebach commented 4 years ago

Do you mean rewrite history and give the instructions you mentioned ?

BTW: still lots of not needed MegaBytes in the repo ... https://github.com/boltsparts/BOLTS/tree/b47ae5fb53b4975320867909cfd0de2641f6bf15/output These is even the website. The new (looks exactly like the old) is generated as a BOLTS backend too. Can be found here https://github.com/boltsparts/boltsparts.github.io

KOLANICH commented 4 years ago

Do you mean rewrite history and give the instructions you mentioned ?

Yes, really. Rebasing some patches manually is a minor inconvenience, overbloated repo is a major one.

berndhahnebach commented 4 years ago

I am involved in FreeCAD project. In such a project I would never ever think a second about rewriting history of the main repo master branch. But BOLTS in in a situation with no PR ATM and less traffic. We do not have any development or release branches in the repo. You are may be right. We will never ever get a better chance to do it.

I will keep you informed.

bernd

berndhahnebach commented 4 years ago

BTW: The cloned repo is 166 MB whereas the real code is still 94 MB and the .git is 71 MB. Means we will not save extremely much.

berndhahnebach commented 4 years ago

Ahh ok in downloads are 61 MByte of binary data. I have done BOLTS dev for years and never realized this. I must admit I have seen it just a few seconds before and it disturbs me ...

@johannes: I would probably have done exactly the same 7 years ago with the knowledge I had at that time :-)

berndhahnebach commented 4 years ago

OK the code is 33 MB whereas the drawings are 9.5 MB and the website backend is 21.5 MB

johannes commented 4 years ago

Git actually is quite good in avoiding copies and merging similar objects. But yeah, keeping larger files out reduces clone&push times which is great. Unfortunately getting files out requires rewriting history, which means all clones are invalid ....

Anyways, I have. I idea about this project and was probably highlighted by mistake :-) (unsubscribed now, so please don't @ me again)

berndhahnebach commented 4 years ago

gave it a try ...

# informations
https://myopswork.com/how-remove-files-completely-from-git-repository-history-47ed3e0c4c35
https://stackoverflow.com/questions/6403601/purging-file-from-git-repo-failed-unable-to-create-new-backup

# command and test
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch path_to_file" HEAD
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch downloads/freecad/BOLTS_FreeCAD_0.2_gpl3.tar.gz" HEAD

# **************************************************************************************************
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch downloads/*" HEAD
rm -rf .git/refs/original
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch output/*" HEAD
rm -rf .git/refs/original

but still .git has 75 MB in file explorer. du even shows 100 MB ...

berndhahnebach commented 4 years ago

deleting unreference blobs with

git gc --aggressive --prune=all

from https://stackoverflow.com/questions/1904860/how-to-remove-unreferenced-blobs-from-my-git-repo/14728706

makes .git in file manager and by du 65 MB small, means the whole repo ist still 99 MB = 33 MB code and 65 MB .git

berndhahnebach commented 4 years ago

pushed it to a new reop on my github ... https://github.com/berndhahnebach/stripedbolts

When I clone this one I have still 33.8 MB code but only 18.1 MB .git = 51.9 MB

LGTM, may be one of you guys can make it even smaller? We probably will never ever get chance again.

berndhahnebach commented 4 years ago

Anyways, I have. I idea about this project and was probably highlighted by mistake :-) (unsubscribed now, so please don't @ me again)

sorry johannes. Yes you where highlighted by mistake. The real one would have been @jreinhardt Sorry for the inconvinience.

BTW: We are aware of you have said and we are disscussing if it is worth.

cheers bernd

jreinhardt commented 4 years ago

Hi,

yes, lets do this.

There might be even more to win, when I check the largest blobs in the repo (https://stackoverflow.com/questions/10622179/how-to-find-identify-large-commits-in-git-history), many of those are js files with literal 3d models for 3d.js. This is about 40 MB (but probably compresses quite well in the pack files).

Also when using filter-branch, tags are unaffected and might still reference of big blobs and keep from being garbage collected. So I removed all tags and branches except the main branch.

Anyway, my attempt with

git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch downloads/ output/ html/3dviews/*" HEAD

gave me

19M ./.git 54M .

So I guess this is more or less the same than what Bernd achieved...

KOLANICH commented 4 years ago

https://github.com/KOLANICH/strippedbolts

53M . (it was 41, but when I downloaded from the repo, it became 53, likely LFS files have just not been checked out) 18M ./.git

Have put all the images, fonts, compiled translations and FreeCAD zips into LFS. Fonts and zips don't give any noticeable gain, but they are binary, their place is there.

We may want to remove some pngs, since they have the same drawing as in svgs. No noticeable gain, probably were modified too few times, in removing or lfsing other file types.

Though LFS has an extremily large drawback - GH considers it as a driver to sell paid services, so they have quotas on them, and also any uploaded permanently eat the quotas of a parent accounts untill the repo is deleted, loosing all the issues, PRs and forks.

So IMHO it doesn't worth, at least untill changes in M$ policy about LFS.

berndhahnebach commented 4 years ago

means we could go for the one on my github.

pwab commented 3 years ago

I'm also for keeping the size of the repository as small as possible. As those download files seem to not correspond with the published releases I'm not sure why they are kept in the first place. Sorry if I get something wrong here.

Also the gh-pages branch seems to be not needed anymore as the websites repository is boltsparts.github.io.


Just for reference this is the current repository master: grafik

luzpaz commented 2 years ago

bump

berndhahnebach commented 2 years ago

found a problem ... I have some branches ... https://github.com/berndhahnebach/BOLTS/branches/all They are not part of the git tree anymore. But most of them have just a few commits, means cherry picking would work. At least not a problem.

berndhahnebach commented 2 years ago

stripped this directory too, I have it to delete after website generation anyway to get the webpage up backends/website/static/source/bootstrap-3.2.0/ This gives another 6.3 MB ... 46 MB

berndhahnebach commented 2 years ago

If I move the repository to an archive repo, all issues and PRs will be moved too. But we could recreate them if needed and set a link to the Archive repository.

berndhahnebach commented 2 years ago

I am curious if more regressions will come up.

berndhahnebach commented 2 years ago

to clearly state it is another repo the new repo could be named bolts instead of BOLTS. makes it even easier to put in on a keyboard. Thus a link to and issue would never link to the wrong issue because the new repo will have new issues.

berndhahnebach commented 2 years ago

since I move the repo all forkes will still work. After the move I will make a last commit. In a repo README.md I will explain and link to this issue.

berndhahnebach commented 2 years ago

The master/main branch of the new repository will be main. This is because of the new guidelines and it states there has something changed.

berndhahnebach commented 2 years ago

just tried the repo names are not case sensitive. Thus to get no mix we would need to use a other reponame. I will use boltsparts for the new main stripped BOLTS repo.

berndhahnebach commented 2 years ago

links are not broken somehow github seams to know the repo name has changed.

berndhahnebach commented 2 years ago

a new BOLTS was born ... https://github.com/boltsparts/boltsparts

I do not close it ATM, see what will happen ...

Moult commented 2 years ago

awesome! Is it possible to transfer issues?

berndhahnebach commented 2 years ago

good question,

berndhahnebach commented 2 years ago

https://docs.github.com/en/issues/tracking-your-work-with-issues/transferring-an-issue-to-another-repository