denshoproject / ddr-cmdln

Command-line tools for automating the Densho Digital Repository's various processes.
Other
0 stars 2 forks source link

Git speedups for large repos #207

Open gjost opened 2 years ago

gjost commented 2 years ago

Spend no more than 2 days on this.

The ddr-densho-1000 is really huge and this causes usability problems even when the repo is checked out locally. In particular, git status takes forever to run. Repo has tons of files and also a long history (~4000 commits).

IDEA cp ddr-densho-1000 ddr-densho-1000new, remove .git/, git init where does the slow come from? TODO research git performance (num objects, size, repo age) TODO can we set git caching interval? TODO profile git operations does not correlate to number of objects of phsyical size of repo seems to be length commit history

Ways to improve git status performance (2012) https://stackoverflow.com/questions/4994772/ways-to-improve-git-status-performance 10 GB repo on NFS on Linux. First time git status ~36min, subsequent 8min

Slow Git Performance (2021) https://support.purestorage.com/Knowledge_Base/FlashBlade_KB/Slow_Git_Performance

OPTIONS

Shallow clone git clone --depth=50 --no-single-branch COLLECTION

Sparse checkout https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/ git clone COLLECTION git sparse-checkout init --cone git sparse-checkout set ...

Partial checkouts https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/ Blobless clones: git clone --filter=blob:none Treeless clones: git clone --filter=tree:0

TODO Test shallow,sparse clones TODO test on Dana's machine

gjost commented 2 years ago

We Put Half a Million files in One git Repository, Here’s What We Learned https://canvatechblog.com/we-put-half-a-million-files-in-one-git-repository-heres-what-we-learned-ec734a764181 To reduce the amount of work git needs to do to find changes, we used the fsmonitor hook with Watchman so we capture changes as they happen instead of having to scan all files in the repository every time a command is run. We also enabled feature.manyFiles, which under the hood enables the untracked cache to skip directories and files that haven’t been modified. Git also has a built-in command (maintenance) to optimize a repository’s data, speeding up commands and reducing disk space. This isn’t enabled by default, so we register it with a schedule for daily and hourly routines. Sparse checkout If an engineer can tell us what they usually work on, we can craft a checkout pattern that includes all the required dependencies to run and test their code locally while keeping the checkout as small as possible. Sparse checkout drawbacks:

https://news.ycombinator.com/item?id=31762245 Interesting

The Case Against Monorepos (Infoworld)

Trunk-Based Development: Monorepos (https://trunkbaseddevelopment.com/monorepos) monorepo.tools - Everything you need to know about monorepos, and the tools to build them (https://monorepo.tools)