bu-cnso / git-introduction

A brief introduction to git and GitHub
1 stars 10 forks source link

Introduction Introduction, and Why Use Git/GitHub? #10

Open asoplata opened 8 years ago

asoplata commented 8 years ago

(I'm assuming github_model.md is going to constitute the main presentation? Because I can't see any other non-README files)

Terminology / More basic basics

The terminology of Git is notoriously obtuse (as are the man pages), and people may want clarification on terms / what is actually going on / why are you doing X, and how does X factor into the greater mental model of a Git tree -- from the perspective of someone who's never seen Git before.

This Intro to Git for Scientists web presentation has lots of pretty pictures illustrating some of the basics (though it goes into great detail that's potentially skippable). Until much later in the presentation, this is less "for scientists" and more "intro to git".

John McDonnell's Git for Scientists: A Tutorial has lots of pretty pictures, most of which may have been taken from www.git-scm.com , illustrating the underlying workings of Git, but also the essential use process.

Depending on time and audience-expertise level, more terminology explanation may take up too much time.

Why Use Git/GitHub?

I was made uncomfortable by my inability to convince Yohan that it's useful for scientists. Very uncomfortable (although part of that is it sounds like he'd ALREADY CODED snapshot/commit-saving per paper, to keep a reproducible and accurate script available in backup in order to reproduce analysis/figures. I doubt most scientists have that, and the easiest way to get there is Git IMHO.).

Here is a good stackoverflow Q&A about using it in scientific programming and another (though the file-efficiency answer doesn't apply to Git).

Some of the reasons I can come up with:

  1. Backup: It acts as a backup system for the ENTIRE history of the repo/project (that you've committed, of course). Dropbox, for example, is only good for 30 days of backups by default (or up to a year if you pay extra for an extension). So if you accidentally set an analysis = true that hosed all the analysis you've performed for the past 2 months of the experiment, Git would save you (assuming you're committing at reasonable intervals as you develop code).
    • sub-point: You can even programmatically investigate past code changes with the awesomely named git blame command!
    • sub-point: Reverting to a previous state is trivial. Instead of going to a web interface, scrolling through your files, scrolling through your times, tearing out your hair in frustration because you don't know which files are before and after your changes -- instead, since you can write as much as you want in descriptive commit messages, you can find the commit has (e.g. er5342...) and just git checkout er5342 and bam - all the code is now just as it was. Furthermore, if you're not sure where in the history you want but you know which file, this stackoverflow thread goes over many different ways to do this, all of which come out-of-the-box from git.
  2. Command power: All the git commands are multifaceted, and very powerful - the stackoverflow thread I mentioned earlier is a good example of this power, and also is a representative example of how there is TONSSS of documentation written online for noobies, or people who aren't experts at Git. Another good example is git diff -- how different are these commits from each other? How different is this version of this file in this commit vs. another commit?
  3. Universal identification of code: Each commit, no matter the branch, can be uniquely identified in your repo across the history of the project by its commit hash. This is important for reproducible science, because the only thing you need to communicate the EXACT code that you ran is your Git repo (which is probably online at GitHub, so you don't need to email a .zip) and the commit hash like "af52344...". If everything you used to generate the figures for a paper is in that commit, then, generally speaking, people can reproduce your code! This is important because you may want to iterate on / further develop the code you used in a paper in order to do more science. That's great -- but committing that version also means you don't have to worry about further changes writing over the actual version of the code you used for the paper...in which case, unless you'd made other backups by hand, you would have lost the original code run for the paper, forever! With Git, saving that useful commit version is just as trivial as any other commit, but you can ALSO continue on developing/changing that same code without worrying about losing that version.
  4. Merging, Collaboration, and File Conflicts: Say you and your collaborator are writing code for the same project (an equivalent example would be collaborating on a paper inside a Git repo!). You both come in, develop some code, and push your commits to the lab server -- but both of you have made different changes to the same file in each of your commits! This conundrum is (IIRC) the main reason Version Control Systems like Git exist: they come with several different tools for handling "merge conflicts" etc. precisely for this situation, so you can use Git by itself to accept/reject/merge competing changes in the version history of the files however you want. Notice how I said accept/reject changes? You know like in a Microsoft Word .docx when it's trying to...track the version history of your manuscript ;) ? Well, since it's Git, anyone with access to, say, your lab server's (or GitHub's) copy of the repo can then make changes to said manuscript, and because all commits and all lines of code/text have their authorship noted, you can trivially see who wrote what, or who wrote what suggestions!
  5. Branching: Say you're not done writing some analysis code, but you just thought up a great idea for a new simulation capability that you want to add. Problem is, this new simulation thing is going to cause you to make a lot of changes in the core code. One possible answer is to make a new "branch" for this new feature, so that you can keep developing the simulation feature, which involves changing a lot of code, but wait until it's fully completed before merging it back into the "master" version of your code. This way, you can alternate between working on the analysis code branch (all Git code is on some branch) when you want, and the separate simulation code branch, but because they're on separate branches, the two sets of changes won't see each other / interfere until you WANT them to, i.e. by merging one of the branches into some other branch like your "master" branch. Another use of this is, if you're working on code but need someone else's input before proceeding or are stuck, you can commit to save your place in a branch, and then immediately switch to another branch to keep working on a different section of the code.

I can't think of any more "main" reasons yet. If people want me to proselytize on this issue at the CNSO thing then I can do that, although I'm not sure what I would do about visual aids for these.

effigies commented 8 years ago

(I'm assuming github_model.md is going to constitute the main presentation? Because I can't see any other non-README files)

Other PRs for other bits. Still in progress.

Haven't read the rest. Thanks for the comments.

effigies commented 8 years ago

@asoplata Any comments on #8? It's a kind of informal discussion chunk, so working your points into it may be appropriate. Otherwise, feel free to add a PR with a new lesson topic. We may not (almost certainly won't) get to it tonight, but it would be nice if this repository becomes a resource for more than just the length of the presentation.