Large datasets - Githubissues

eirenjacobson commented 7 years ago

Some of the data associated with my project is too big to upload to GitHub (>50 MB). Is there a way to keep those folders of data associated with the R project and track them with Git (although: they shouldn't change, if that matters) without actually uploading them to GitHub?

ha0ye commented 7 years ago

To store an offline backup, maybe use Git LFS. Otherwise, you can not commit it. You can also create a .gitignore file so that Git does not track it.

eirenjacobson commented 7 years ago

Update: I created a .gitignore file and told git to retroactively forget that those data folders and files ever exists (git rm -r --cached ). This worked for files, and also worked on my local machine (i.e., the folders and files no longer appeared as tracked by git) but when I pushed to the remote repo, git really really wanted to upload the contents of those folders even though it shouldn't have known they existed. After a few hours, I gave up, re-cloned, and re-constructed my directory with the .gitignore file present at the initial commit. This worked and I was able to push to the remote repo without uploading large data files.

ha0ye commented 7 years ago

Hmm, it sounds like "git rm -r cached" should have worked, but if you have intermediate commits with the files added in, they will be synced when you push to GitHub.

GrantRVD commented 7 years ago

@eirenjacobson: Removing files from a git-tracked folder's or repo's history is a non-trivial thing to do, since removing the file also means having to change the history of tracking that file, thus modifying your entire commit history for as long as the large file has been there. Essentially your only chance to fix the issue is to 'undo' the commit that added the file in the first place - so you either have to catch the problem before making any further commits or you have to rollback your project, which itself introduces the problem of then having to add back in all the changes you want to keep.

For future reference, doing it the easy way (i.e. removing the file after the commit that accidentally added it and before any others), @ha0ye's answer gives you part of the solution. You'll want to run these two commands.

git rm --cached <file_name>
git commit --amend -CHEAD

Note the double-dashes for certain arguments and the single-dash for CHEAD. You can add multiple files to the first line instead of <file_name>, separated by spaces. The first line stages the file for removal and the second amends the previous commit (the one that added the file to the git history) so that the specified file(s) doesn't appear anymore. It is not sufficient to just make a new commit with git commit -m <message>, because doing so wouldn't overwrite the commit that added the file, hence the --amend argument.

As for your original question about tracking a file without pushing it. There are two ways.

Create an entirely separate file for the data/files outside of the one connected to your remote repo
Add a new folder to .gitignore, put all the (new) files you don't want pushed to github in that folder, cd to the new folder, and run git init there. That will tell git to track these files separately, and they won't be included in any push to the remote repo connected to the parent directory.

Open-Data-Science-at-SIO / RRROBOTS

Large datasets #15