FAIRmat-NFDI / nexus_definitions

Definitions of the NeXus Standard File Structure and Contents
https://manual.nexusformat.org/
Other
9 stars 8 forks source link

Repository extremely large #237

Closed rettigl closed 4 months ago

rettigl commented 5 months ago

This repository has grown extremely large over time (>1GB of download for cloning). It appears to be due to copies of files docs/mpes-refactor/.doctrees/environment.pickle with each more than 50 mbyte, tons of copies of pdf/NeXusManual.pdf, and docs/mpes-liquid/_downloads/0d9b3db52a075e9d9b6a1a0457a842ba/nxdl_vocabulary.json, so all old docs artefacts. Why are they part of the repository?

Check with:

git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 | cut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

mkuehbach commented 4 months ago

We (@mkuehbach and @sanbrock) inspected the situation and open the following discussion points:

lukaspie commented 4 months ago

One idea for having "non-history" branches:

domna commented 4 months ago

One idea for having "non-history" branches:

I'm not sure if orphan is what we need. The github pages branch is actually an orphan branch, because it does not contain the history of the original repo. What we want is a branch which does not keep any history, just the latest state. Basically, we want a basic file system without any history 😅

Since this is an orphan branch we might also just able to delete this branch from the repo entirely and getting rid of the history. That way we wouldn't even need to rewrite the git history

domna commented 4 months ago

Maybe we should just switch to https://about.readthedocs.com/?ref=readthedocs.org ? But afaik they don't have branch based deployment (but they version based on git tags).

lukaspie commented 4 months ago

I am wondering if we could just delete most of the content that sphinx generates before pushing to the fairmat-docs branch. For building the website, we basically only need the html files and the static assets, right? So e.g. .doctrees (which contains the largest files actually) could just be removed. This doesn't get rid off all contents in the git history going forward, but at least we don't get as much data in for each commit.

domna commented 4 months ago

I am wondering if we could just delete most of the content that sphinx generates before pushing to the fairmat-docs branch. For building the website, we basically only need the html files and the static assets, right? So e.g. .doctrees (which contains the largest files actually) could just be removed. This doesn't get rid off all contents in the git history going forward, but at least we don't get as much data in for each commit.

Yes, this we should definitely do anyways. I actually thought this is what I did by just copying the build html folder (https://github.com/FAIRmat-NFDI/nexus_definitions/blob/26d4faa5c6950161e48f0672f3fdfd8c9bc907e2/.github/workflows/fairmat-build-pages.yaml#L37) But it seems to also contains the build artifacts. I think the action has some cleanup option we just can use

lukaspie commented 4 months ago

Another option I just came across: https://github.com/peaceiris/actions-gh-pages has a force-orphan option. When we deploy, we could first delete the docs/old-branch folder for all the branches that are not active anymore (using git rm -rf docs/old-branch), add the new docs for the branch that we are working it, and then deploy to the fairmat-docs branch with the force-orphan option. Not sure if that would work though.

domna commented 4 months ago

Another option I just came across: https://github.com/peaceiris/actions-gh-pages has a force-orphan option. When we deploy, we could first delete the docs/old-branch folder for all the branches that are not active anymore (using git rm -rf docs/old-branch), add the new docs for the branch that we are working it, and then deploy to the fairmat-docs branch with the force-orphan option. Not sure if that would work though.

This force-orphan is exactly what we want. Good catch! We don't even need to track the folders I think, because we also have a ci which deletes the old folders when the branch is deleted (but currently it stays in the git history of course). I think if we replace the current ci with this action and activate force-orphan we should be good for the future. Then we just need to solve how to remove the old branch with its entire history.

rettigl commented 4 months ago

Since this is an orphan branch we might also just able to delete this branch from the repo entirely and getting rid of the history. That way we wouldn't even need to rewrite the git history

If all the large files really are only in one branch, wouldn't it be sufficient to reset this branch to it's first commit, add the latest version, and force-push to delete all the commits on this branch?

domna commented 4 months ago

If all the large files really are only in one branch, wouldn't it be sufficient to reset this branch to it's first commit, add the latest version, and force-push to delete all the commits on this branch?

Yes this is kind of my idea, that we just can remove this branch and remove the large history with it. git however sometimes still keeps the commits under certain conditions (there are some rollback options which keep them). But I think there is definitely a solution to this

lukaspie commented 4 months ago

268 helped to bring down the repository size to below 200 MB. We can now think if there are any other things we can remove to make it even smaller.

mkuehbach commented 4 months ago

@lukaspie and @domna, great work, as it also implicitly addresses which of these many documentations we need to have on display. Two points 1.) Apart from the ".doctrees" directory also the "_source" repository can be deleted that is also sphinx build cache mainly *.rst.txt documents from which the html is generated 2.) There are still a couple of legacy PDF documents that we could get rid of in our fork and instead point people to as they are a part of the original NIAC repo.

mkuehbach commented 4 months ago

Inspected the situation with the PDFs files. Turns out these are remnants of intermediate work that we inherited from the original NIAC branch. They had also a period in their history where they released the documentation into the same repo including pdfs sometimes not only the NeXusManual.pdf (which we only test for if it compiles but don't even store) but also the so-called ImpatientGuide. Indeed, an inspection of commits related to *.pdf blobs identifies all of these blobs to be referred in commits in between 2011 and 2022, https://github.com/nexusformat/definitions/tree/2dbe08fe is an exemplar one such where pdf/ is right in the top level directory and this is why we are still carrying well of another approx. 100 MB of unnecessary copies with us.

@sanbrock @lukaspie we should propose this to NIAC and then remove this I am almost sure this payload is 100% from NIAC times still and worth to be erased from the fairmat branch. git-filter-repo applied in a sandbox and instructed to remove all pdf blobs reduces the repo down to 50 MB, given that there are also still some old publication I think it is worth to do this final step to have definitions finally for everybody in a blazing fast and clean condition.

lukaspie commented 4 months ago

I think a good idea would be to fork this repo and run the clean-up (with git-filter-repo) in the fork and then we can see what is the difference between this repo and the fork. Can you try that @mkuehbach? If it looks fine, we can then just do this for our own repo and afterwards suggest to NIAC that they do this themselves on their repo as well.

mkuehbach commented 4 months ago

You mean that you would like to get that additional check and perspective of the GUI with a regular PR from that cleaned fork on my own github towards our nexusdefinitions (which is my forks upstream repo) right?

lukaspie commented 4 months ago

Yeah, this may work. I haven't used git-filter-repo much, so I am not sure if an actual PR would work.

mkuehbach commented 4 months ago

So I followed this suggestion: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository#purging-a-file-from-your-repositorys-history for www.github.com/mkuehbach/nxdefs-cleaned

Specifically: 1.) Cloned, 2.) Made a backup of .git/config 3.) Rewrote using the following command

git-filter-repo --invert-paths --path "pdf/" --path "legacy_docs/" --path "2010-05-10-workshop/" --path "workshop/" --path "misc/" --path "manual_archive/" --path "impatient/" --path "_static/" --path "_images/" --path "_sphinx/" --path "_sources/" --path "_downloads/"

4.) git count-objects -vH, went down to 53.24 MB 5.) Replaced .git/config in which as expected filter-repo removed the remote which a safety measure to avoid people accidentally pushing back with my backup 6.) git push origin --force --all

7.) git push origin --force --tags

But now one has to contact GitHub to request them to remove dangling references ...

mkuehbach commented 4 months ago

image

So in principle one could do that HOWEVER: Changed commit SHAs may affect open pull requests in your repository. We recommend merging or closing all open pull requests before removing files from your repository.

So I did the experiment we have now some understanding about it but there are open PRs from us and others. To prevent possible harm, I will therefore delete my fork now as it can easily be recreated ones we have all PRs merged and closed if we want to pursue this force push then.

mkuehbach commented 4 months ago

@lukaspie @sanbrock above is my report about this exercise

lukaspie commented 4 months ago

Thanks for checking. To me, it doesn't really seem worth the effort of potentially having to recover things that could potentially break. The repo is now relatively small and due to the new deploy workflow, it will not grow much bigger in the future. Therefore, I vote we skip this git-filter-repo step.

I'll close the issue now, we can make a new one if we ever want to consider this again.

rettigl commented 4 months ago

image

So in principle one could do that HOWEVER: Changed commit SHAs may affect open pull requests in your repository. We recommend merging or closing all open pull requests before removing files from your repository.

So I did the experiment we have now some understanding about it but there are open PRs from us and others. To prevent possible harm, I will therefore delete my fork now as it can easily be recreated ones we have all PRs merged and closed if we want to pursue this force push then.

I think merging such a fork/PR will not remove any data, but rather add another 4k commits. This is certainly not what we want, so really only the force-pushing of the cleaned repo would do. But as Lukas commented, I would also not vote for such a breaking change, as you have nicely solved the main issue without this.

mkuehbach commented 4 months ago

Thank you that we have all understood that this PR was never meant to be actually merged but meant to serve as an exercise for a person with owner-equivalent role and rights interested in cleaning the repository.