Large repo size - Githubissues

jmtsuji commented 5 years ago

The BackBLAST repo is currently a couple hundred MB in size, which is quite large. I suspect this is mostly due to old ExampleData files, which have now been removed in BackBLAST2.

We will need to clean the repo somehow to get the repo size down eventually. We have a couple options:

Make a completely new Github repo for BackBLAST2 (and leave the current master branch more-or-less as-is for BackBLAST)
Delete this repo, clean up the git history locally (e.g., the article you sent earlier), and re-make the BackBLAST repo (possibly renamed) with a simplified git history. Might have to lack the old ExampleData folder.

@LeeBergstrand Thoughts? Best practices?

LeeBergstrand commented 5 years ago

@jmtsuji I was under-reading when I sent you that article. The approach this guy uses is lazy and incorrect. I honestly thought he was doing the below. You can prune files out of your git history locally and then force overwrite your local history onto Github's. That way you don't have to make any new repo's or lose any files from the past.

See here for how to do it: https://help.github.com/en/articles/removing-sensitive-data-from-a-repository

The above is a tutorial for removing senstive files (think those containing AWS keys) from the repo. The concept the same for large files. You prune them from the history.

Key Facts

Every time you commit a file, whether large or containing sensitive data, to a git repository (whether on your machine or GitHub) a copy of this file is saved in the repositories history, even if the file committed is physically removed.
The above is so you can access it later.
To remove such data permanently, you have to purge them from the git repository history
In doing so, you edit the history to say that these files were never added.
You can replace the history of a remote git repository (like Github) with your own by doing a force push.
Github is just another remote repository.

LeeBergstrand commented 5 years ago

How to find the large files.

https://stackoverflow.com/questions/10622179/how-to-find-identify-large-commits-in-git-history

LeeBergstrand commented 5 years ago

@jmtsuji I would expect the repo to be several tens of megabytes even after the large files are removed. Squashing commits may make history smaller as well.

https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History

jmtsuji commented 5 years ago

@LeeBergstrand Thanks for the tips. I'll try downsizing the repo when I have some time. Will warn you beforehand so that neither of us are developing the code during the pruning process. It might take me a couple months to get to this -- will leave this issue open for now,

LeeBergstrand / BackBLAST_Reciprocal_BLAST

Large repo size #47