BIOL548O / Discussion

A repository for course discussion in BIOL548O
0 stars 0 forks source link

removing big data files out of your repo history #15

Open JoeyBernhardt opened 8 years ago

JoeyBernhardt commented 8 years ago

Hi @BIOL548O/2016_students and @aammd,

I've run into some problems with version controlling some large data files (i.e > 100 MB) that I created as part of our homework! When I tried to push my changes, I got some error messages that look like this:

remote: warning: File data/length-all.csv is 58.19 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com. remote: error: Trace: 0bcaa1f8c80a10cdb30da6ca9f0ca43f remote: error: See http://git.io/iEPt8g for more information. remote: error: File data/all.csv is 179.71 MB; this exceeds GitHub's file size limit of 100.00 MB To https://github.com/BIOL548O/bernhardt_joey.git ! [remote rejected] master -> master (pre-receive hook declined)

So basically, I need to remove these files from my repo.

I tried following these instructions...with not much luck.

Does anybody have any experience using this BFD repo cleaner? It was recommended on the Github help pages. Before I head down that path, I'm wondering if anyone had run into similar issues and has any advice for me! Thanks all!

aammd commented 8 years ago

oh no! this is my fault for suggesting that you bind all those large files together!

If I can guess at what happened: you combined all the files together (as I suggested :blush: :sob: ) and then you wrote that to a gigantic CSV, committed it, and tried to push it. Github rejected it, because it is JUST TOO BIG.

This is OK. here are some options:

git reset --hard

reader beware git reset --hard is strong magic: it will destroy everything in your local directory (equivalent to deleting the whole project and cloning your git repo anew) use with caution, you have been warned

http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html
https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History

aammd commented 8 years ago

For our class purposes:

@JoeyBernhardt , may I suggest that you process some subset of the data for the benefit of your peer reviewer. Here are some suggestions:

This way, the peer reviewer can rerun your script and confirm the output, without having to actually download the massive file. Then do one of the following (or something similar)

JoeyBernhardt commented 8 years ago

@aammd you are the best!!! thank you so much for your quick replies and all the help, I really appreciate it!

I tried git reset --hard and that didn't work...I don't know why.

So then I tried git checkout using a commit from this morning and then git add/pull/push, and somehow that worked!

JoeyBernhardt commented 8 years ago

Hi @aammd,

Thanks for the tips re: making this manageable for a peer-reviewer. I'll try the mini-dataset approach that you describe using dplyr::sample_n or dplyr::sample_frac.

JoeyBernhardt commented 8 years ago

also, @aammd, it totally wasn't your fault...I had made the giant csv before your (super helpful) comments, using something like this, which I modified from something I found on the internet:

getwd()
root<-list.dirs(".", recursive=TRUE)
# get list of files ending in csv in directory root
dir(root, pattern='csv$', recursive = TRUE, full.names = TRUE) %>%
# read files into data frames, select particle ID and length columns and add filename 
lapply(FUN = function(p) read.csv(p) %>% select(1,17) %>% 
mutate(file_name=p)) %>%     
# bind all data frames into a single data frame
    bind_rows() %>%
#'write into a single csv file
    write.csv("/Users/Joey/Documents/courses/biol548/bernhardt-joey/data/length-all.csv")