removing big data files out of your repo history

JoeyBernhardt commented 8 years ago

Hi @BIOL548O/2016_students and @aammd,

I've run into some problems with version controlling some large data files (i.e > 100 MB) that I created as part of our homework! When I tried to push my changes, I got some error messages that look like this:

remote: warning: File data/length-all.csv is 58.19 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com. remote: error: Trace: 0bcaa1f8c80a10cdb30da6ca9f0ca43f remote: error: See http://git.io/iEPt8g for more information. remote: error: File data/all.csv is 179.71 MB; this exceeds GitHub's file size limit of 100.00 MB To https://github.com/BIOL548O/bernhardt_joey.git ! [remote rejected] master -> master (pre-receive hook declined)

So basically, I need to remove these files from my repo.

I tried following these instructions...with not much luck.

Does anybody have any experience using this BFD repo cleaner? It was recommended on the Github help pages. Before I head down that path, I'm wondering if anyone had run into similar issues and has any advice for me! Thanks all!

aammd commented 8 years ago

oh no! this is my fault for suggesting that you bind all those large files together!

If I can guess at what happened: you combined all the files together (as I suggested :blush: :sob: ) and then you wrote that to a gigantic CSV, committed it, and tried to push it. Github rejected it, because it is JUST TOO BIG.

This is OK. here are some options:

destroy your local changes and start over with what you have on github. (ie picking up from the last commit you pushed to github.)

git reset --hard

reader beware git reset --hard is strong magic: it will destroy everything in your local directory (equivalent to deleting the whole project and cloning your git repo anew) use with caution, you have been warned

Another solution (if you love whatever work you have done since creating the huge file) is to use git rebase to "squash" a series of commits. let me know if you want to go that way, or see

http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html
https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History

aammd commented 8 years ago

For our class purposes:

@JoeyBernhardt , may I suggest that you process some subset of the data for the benefit of your peer reviewer. Here are some suggestions:

continue to write the huge CSV file as you were doing, but do not commit it. instead, add its name to .gitignore

This way, the peer reviewer can rerun your script and confirm the output, without having to actually download the massive file. Then do one of the following (or something similar)

calculate summary statistics (median etc) for this huge data file, then push that instead. the Peer reviewer will be able to confirm that it is tidy
push up a sample of the data. Something produced with head or dplyr::sample_n or dplyr::sample_frac. Combined with group_by, these approaches would generate a nice subsample of the dataset -- perfect for a peer reviewer to asses, though obviously not useful for data analysis (you'd want to use the full dataset)

JoeyBernhardt commented 8 years ago

@aammd you are the best!!! thank you so much for your quick replies and all the help, I really appreciate it!

I tried git reset --hard and that didn't work...I don't know why.

So then I tried git checkout using a commit from this morning and then git add/pull/push, and somehow that worked!

JoeyBernhardt commented 8 years ago

Hi @aammd,

Thanks for the tips re: making this manageable for a peer-reviewer. I'll try the mini-dataset approach that you describe using dplyr::sample_n or dplyr::sample_frac.

JoeyBernhardt commented 8 years ago

also, @aammd, it totally wasn't your fault...I had made the giant csv before your (super helpful) comments, using something like this, which I modified from something I found on the internet:

getwd()
root<-list.dirs(".", recursive=TRUE)
# get list of files ending in csv in directory root
dir(root, pattern='csv$', recursive = TRUE, full.names = TRUE) %>%
# read files into data frames, select particle ID and length columns and add filename 
lapply(FUN = function(p) read.csv(p) %>% select(1,17) %>% 
mutate(file_name=p)) %>%     
# bind all data frames into a single data frame
    bind_rows() %>%
#'write into a single csv file
    write.csv("/Users/Joey/Documents/courses/biol548/bernhardt-joey/data/length-all.csv")

BIOL548O / Discussion

removing big data files out of your repo history #15