Closed rettigl closed 4 months ago
We (@mkuehbach and @sanbrock) inspected the situation and open the following discussion points:
One idea for having "non-history" branches:
One idea for having "non-history" branches:
I'm not sure if orphan is what we need. The github pages branch is actually an orphan branch, because it does not contain the history of the original repo. What we want is a branch which does not keep any history, just the latest state. Basically, we want a basic file system without any history 😅
Since this is an orphan branch we might also just able to delete this branch from the repo entirely and getting rid of the history. That way we wouldn't even need to rewrite the git history
Maybe we should just switch to https://about.readthedocs.com/?ref=readthedocs.org ? But afaik they don't have branch based deployment (but they version based on git tags).
I am wondering if we could just delete most of the content that sphinx generates before pushing to the fairmat-docs branch. For building the website, we basically only need the html files and the static assets, right? So e.g. .doctrees
(which contains the largest files actually) could just be removed. This doesn't get rid off all contents in the git history going forward, but at least we don't get as much data in for each commit.
I am wondering if we could just delete most of the content that sphinx generates before pushing to the fairmat-docs branch. For building the website, we basically only need the html files and the static assets, right? So e.g.
.doctrees
(which contains the largest files actually) could just be removed. This doesn't get rid off all contents in the git history going forward, but at least we don't get as much data in for each commit.
Yes, this we should definitely do anyways. I actually thought this is what I did by just copying the build html folder (https://github.com/FAIRmat-NFDI/nexus_definitions/blob/26d4faa5c6950161e48f0672f3fdfd8c9bc907e2/.github/workflows/fairmat-build-pages.yaml#L37) But it seems to also contains the build artifacts. I think the action has some cleanup option we just can use
Another option I just came across: https://github.com/peaceiris/actions-gh-pages has a force-orphan
option. When we deploy, we could first delete the docs/old-branch
folder for all the branches that are not active anymore (using git rm -rf docs/old-branch
), add the new docs for the branch that we are working it, and then deploy to the fairmat-docs branch with the force-orphan
option. Not sure if that would work though.
Another option I just came across: https://github.com/peaceiris/actions-gh-pages has a
force-orphan
option. When we deploy, we could first delete thedocs/old-branch
folder for all the branches that are not active anymore (usinggit rm -rf docs/old-branch
), add the new docs for the branch that we are working it, and then deploy to the fairmat-docs branch with theforce-orphan
option. Not sure if that would work though.
This force-orphan
is exactly what we want. Good catch! We don't even need to track the folders I think, because we also have a ci which deletes the old folders when the branch is deleted (but currently it stays in the git history of course). I think if we replace the current ci with this action and activate force-orphan
we should be good for the future. Then we just need to solve how to remove the old branch with its entire history.
Since this is an orphan branch we might also just able to delete this branch from the repo entirely and getting rid of the history. That way we wouldn't even need to rewrite the git history
If all the large files really are only in one branch, wouldn't it be sufficient to reset this branch to it's first commit, add the latest version, and force-push to delete all the commits on this branch?
If all the large files really are only in one branch, wouldn't it be sufficient to reset this branch to it's first commit, add the latest version, and force-push to delete all the commits on this branch?
Yes this is kind of my idea, that we just can remove this branch and remove the large history with it. git however sometimes still keeps the commits under certain conditions (there are some rollback options which keep them). But I think there is definitely a solution to this
@lukaspie and @domna, great work, as it also implicitly addresses which of these many documentations we need to have on display. Two points 1.) Apart from the ".doctrees" directory also the "_source" repository can be deleted that is also sphinx build cache mainly *.rst.txt documents from which the html is generated 2.) There are still a couple of legacy PDF documents that we could get rid of in our fork and instead point people to as they are a part of the original NIAC repo.
Inspected the situation with the PDFs files. Turns out these are remnants of intermediate work that we inherited from the original NIAC branch. They had also a period in their history where they released the documentation into the same repo including pdfs sometimes not only the NeXusManual.pdf (which we only test for if it compiles but don't even store) but also the so-called ImpatientGuide. Indeed, an inspection of commits related to *.pdf blobs identifies all of these blobs to be referred in commits in between 2011 and 2022, https://github.com/nexusformat/definitions/tree/2dbe08fe is an exemplar one such where pdf/ is right in the top level directory and this is why we are still carrying well of another approx. 100 MB of unnecessary copies with us.
@sanbrock @lukaspie we should propose this to NIAC and then remove this I am almost sure this payload is 100% from NIAC times still and worth to be erased from the fairmat branch. git-filter-repo applied in a sandbox and instructed to remove all pdf blobs reduces the repo down to 50 MB, given that there are also still some old publication I think it is worth to do this final step to have definitions finally for everybody in a blazing fast and clean condition.
I think a good idea would be to fork this repo and run the clean-up (with git-filter-repo) in the fork and then we can see what is the difference between this repo and the fork. Can you try that @mkuehbach? If it looks fine, we can then just do this for our own repo and afterwards suggest to NIAC that they do this themselves on their repo as well.
You mean that you would like to get that additional check and perspective of the GUI with a regular PR from that cleaned fork on my own github towards our nexusdefinitions (which is my forks upstream repo) right?
Yeah, this may work. I haven't used git-filter-repo much, so I am not sure if an actual PR would work.
So I followed this suggestion: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository#purging-a-file-from-your-repositorys-history for www.github.com/mkuehbach/nxdefs-cleaned
Specifically: 1.) Cloned, 2.) Made a backup of .git/config 3.) Rewrote using the following command
git-filter-repo --invert-paths --path "pdf/" --path "legacy_docs/" --path "2010-05-10-workshop/" --path "workshop/" --path "misc/" --path "manual_archive/" --path "impatient/" --path "_static/" --path "_images/" --path "_sphinx/" --path "_sources/" --path "_downloads/"
4.) git count-objects -vH, went down to 53.24 MB 5.) Replaced .git/config in which as expected filter-repo removed the remote which a safety measure to avoid people accidentally pushing back with my backup 6.) git push origin --force --all
7.) git push origin --force --tags
But now one has to contact GitHub to request them to remove dangling references ...
So in principle one could do that HOWEVER: Changed commit SHAs may affect open pull requests in your repository. We recommend merging or closing all open pull requests before removing files from your repository.
So I did the experiment we have now some understanding about it but there are open PRs from us and others. To prevent possible harm, I will therefore delete my fork now as it can easily be recreated ones we have all PRs merged and closed if we want to pursue this force push then.
@lukaspie @sanbrock above is my report about this exercise
Thanks for checking. To me, it doesn't really seem worth the effort of potentially having to recover things that could potentially break. The repo is now relatively small and due to the new deploy workflow, it will not grow much bigger in the future. Therefore, I vote we skip this git-filter-repo step.
I'll close the issue now, we can make a new one if we ever want to consider this again.
So in principle one could do that HOWEVER: Changed commit SHAs may affect open pull requests in your repository. We recommend merging or closing all open pull requests before removing files from your repository.
So I did the experiment we have now some understanding about it but there are open PRs from us and others. To prevent possible harm, I will therefore delete my fork now as it can easily be recreated ones we have all PRs merged and closed if we want to pursue this force push then.
I think merging such a fork/PR will not remove any data, but rather add another 4k commits. This is certainly not what we want, so really only the force-pushing of the cleaned repo would do. But as Lukas commented, I would also not vote for such a breaking change, as you have nicely solved the main issue without this.
Thank you that we have all understood that this PR was never meant to be actually merged but meant to serve as an exercise for a person with owner-equivalent role and rights interested in cleaning the repository.
This repository has grown extremely large over time (>1GB of download for cloning). It appears to be due to copies of files
docs/mpes-refactor/.doctrees/environment.pickle
with each more than 50 mbyte, tons of copies ofpdf/NeXusManual.pdf
, anddocs/mpes-liquid/_downloads/0d9b3db52a075e9d9b6a1a0457a842ba/nxdl_vocabulary.json
, so all old docs artefacts. Why are they part of the repository?Check with:
git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 | cut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest