IPS-LMU / The-EMU-SDMS-Manual

The EMU-SDMS Manual
https://ips-lmu.github.io/The-EMU-SDMS-Manual/
Creative Commons Zero v1.0 Universal
2 stars 3 forks source link

Git for emuDB vs R project #5

Closed trblslp closed 5 years ago

trblslp commented 5 years ago

I have followed the recipe for making a git repo out of my emuDB, which is nice, but I'm now a bit unsure what to do if I also want my r project to have version control. Would you suggest having two separate (i.e. sister) folders, one with the emuDB and the other with the R studio project and R scripts, and each of them set up as independent git repos? It'd be great if the manual mentioned this. Thanks!

MJochim commented 5 years ago

@raphywink and I have not always been on the same page when it comes to handling emu databases in git repositories, but I think by now we have converged (to some degree). He'll surely correct me if I'm wrong.

My suggestion is to keep analysis code and data in separate git repos.

The reason is that most everybody will advise against putting binary data (i.e. in the linguistics case: audio/ema/ultrasound recordings and derived signals such as formant tracks) in a git repo in the first place. This is generally considered bad practice, because it's hard for Git to be efficient when these files are changed (as opposed to source code, where Git is very efficient - this is also what Git was designed to do: track changes in source code).

However, I don't really see a problem in the linguistics case, mainly because our binary data hardly ever change. It's therefore not a big deal if changes to them are somewhat inefficient.

But one problem will arise if you store these binary data in the same repo as your analysis code: It gets hard to download only the code. If you want to download the code (which should be very fast since it's usually no more than a few kilobytes), you are basically forced to download all the raw data as well – which is usually slow because it's on the gigabytes order.

@raphywink Maybe we can add some of this to the manual.

raphywink commented 5 years ago

@MJochim toats agree! If locally you want both in the same folder I'd add the emuDB as a Git submodule: https://git-scm.com/book/en/v2/Git-Tools-Submodules

raphywink commented 5 years ago

Oh just btw: the tutorial explains how to setup git-lfs so that reduces (if not solves) the problem of changing binary files.

trblslp commented 5 years ago

Thanks @MJochim and @raphywink – the submodule idea looks interesting, but also I'm not sure what the consequences would be for the average emuDB user, so I think it'd be worth having some pointers on this in the manual. For example, if I have the emuDB as a submodule, how will I then have to navigate my changes and commits in, say, RStudio? If I plan to do some risky changes to the database structure and make new branch for this, how much of a headache will it give me? ... At the moment I have separate git/git-lfs repos for data and code, and I think it's probably the most painless way to do it. But a bit of extra hand-holding in the manual wouldn't hurt.

raphywink commented 5 years ago

hmmm yes and no. In my opinion, Git and EMU are two different beasts and what the manual is trying to show the user is how to get Git versioning up and running and not how to actually use Git. Using submodules alone is a pretty advanced topic and you can do a thousand things with them. However, I def. don't want to open up the can of worms of trying to explain them to the user (plenty of that elsewhere on the web). That being said, I'll def. go through the tutorial and see if I can't add something about "us recommending keeping data + analysis scripts sep."

MJochim commented 5 years ago

But a bit of extra hand-holding in the manual wouldn't hurt.

I applaud your wording ;-).

Personally I have never used submodules for this kind of work, but I think it can be a good idea. What I usually do is, I add a raw-data directory in my code repo and add a .gitignore file inside it with this content:

# Ignore everything in this directory
*
# Except this file
!.gitignore

This way I can have the raw data in a subdir of my code and not have to worry about accidentally committing it. If I want to check out an earlier commit of my raw data, I use the command line instead of RStudio.

MJochim commented 5 years ago

hmmm yes and no. In my opinion, Git and EMU are two different beasts

I definitely agree with you as far as explaining advanced git topics in the emu manual is concerned. However, I have somewhat diverged from the opinion (but this is getting off-topic) that proper separation of concerns requires that emuR and git be thoroughly disjunct code-wise. After all, the EMU-SDMS is supposed to be a database management system and as such should strongly integrate a concept as important as data versioning.

trblslp commented 5 years ago

hmmm yes and no. In my opinion, Git and EMU are two different beasts and what the manual is trying to show the user is how to get Git versioning up and running and not how to actually use Git. Using submodules alone is a pretty advanced topic and you can do a thousand things with them. However, I def. don't want to open up the can of worms of trying to explain them to the user (plenty of that elsewhere on the web). That being said, I'll def. go through the tutorial and see if I can't add something about "us recommending keeping data + analysis scripts sep."

Sorry I should clarify. I definitely wouldn't want you to explain git, more just give the reader some points for consideration/alert them to potential undesirable consequences/suggest best practice/refer some further reading – much as you say here. Using git in this context is, I think, still quite a bit unlike using it for more text-based content. The reason I brought the issue up was that I was following the EMU manual's guide on setting up a git repo, and then thought it would really be a good idea for me to establish a bit of hygiene in my Rstudio project and set it up as a git repo too (following this guide). But since my rstudio project folder was the parent of my emuDB, I thought some kind of horrible recursive versioning nightmare might ensue. I still don't know if that would necessarily occur! 😱

trblslp commented 5 years ago

and I should say that I think that a paragraph or so on these points would probably manage to be comprehensive enough, not a whole page or anything...

raphywink commented 5 years ago

concerns requires that emuR and Git be thoroughly disjunct code-wise

like you hinted at: sort of off topic regarding this issue. Somehow including automatic Git versioning in emuR is a whole different discussion (of which we have had many ;-))

raphywink commented 5 years ago

@trblslp I'll def. look into if we can give some helpful pointers. I just want to avoid explaining / answering Git related questions as Git is super flexible and powerful.

raphywink commented 5 years ago

you guys think that'll do?

trblslp commented 5 years ago

Yep looks good!

raphywink commented 5 years ago

@MJochim if you disagree simply reopen

MJochim commented 5 years ago

nope thats good, thanks!