Bisaloo commented 2 years ago

The goal of this discussion is to settle on a set of common practices around branches & merges in git/GitHub.

New changes in branches

All changes should be done in branches and submitted by pull requests. Even in simple cases, where a review doesn't seem strictly necessary, using pull requests helps the rest of the team to notice the change and to stay up-to-date on the codebase.

- Should we enforce this by protecting the main branch?

At the same time, each pull request should focus on a single feature. Pull requests modifying multiple aspects of the code or fixing multiple unrelated bugs are usually more difficult and thus longer to review, which doesn't align well with our agile development method.

Pull requests should target the main branch. We do not believe it is necessary to have an intermediate develop branch. Because we follow an agile strategy, the stable version on the package corresponds to the latest CRAN release and the development version is the GitHub version.

However, we discourage using branches targeting other branches as this can quickly escalate in difficult to manage conflicts. If you need to add or propose changes to a non-main branch, you should either use the commit suggestions feature from GitHub. In more complex cases, you should agree with the person who opened the pull request to push directly to their branches.

Merge mechanism

GitHub currently has 3 pull request merge mechanisms:

"Create a merge commit"
"Squash and merge"
"Rebase and merge"

The merge commit option is the only option that doesn't rewrite history. As such, it is the preferred option when multiple people commit directly to the same branch without prior concertation.

However, merge commits present other drawbacks, such as creating a heavily non-linear history, which is extremely difficult to read without a client displaying the various branches over time. Notably, it is very difficult to browse & understand a git history with merge commits in GitHub web interface. For related reasons, merge commits can create flat out unintelligible diffs in pull requests in some cases.

The preference of squash vs simple rebase depends on the quality of the commits in the branches to merge. If the commits are grouped logically, with clear commit messages, they can be rebased & merged directly. If not, then a squash & merge should be preferred.

- Should we remove the options we choose to not use from the list of options offered by GitHub?

- Should we encourage contributors to clean & reorganise their git commit history before merging their branches? This blog post is a detailed explanation of why this can be beneficial but it is important to note that this is not also straightforward as reorganising commits can easily create conflicts if not done with care.

It is also important to note that our choice here also impacts how we should resolve diverging histories locally because the various methods are partially incompatible. For example, it is not possible to rebase & merge a branch that already contains a merge commit. Therefore, if we wish to use "Rebase & merge", we must ensure that collaborators do not create merge commits in their branches.

Avoiding merge commit when pulling

One common source of unintended merge commits is running git pull to fetch remote commits when you already have unpushed local commits. A simple way to solve this is to run git pull --rebase instead of just git pull. This setting can also be adjusted globally with the following command:

git config --global pull.rebase true

Future option: merge queues

GitHub has released a 4th merge mechanism (currently in private beta): merge queues.

This is meant to address one major issue that agile projects encounter. If your project has multiple concurrent pull requests, as soon as you merge one of them, checks in the others become out of date because they ran against an outdated version of the main branch. Merge queues solve this issue by creating a queue with all the pull requests you want to merge and running checks against a temporary branch containing all the changes in the queue.

This is likely the mechanism we will want to use in the future but it is currently in beta and detailed documentation is lacking.

TimTaylor commented 2 years ago

FWIW - some thoughts from me. A mini workshop on this is probably worthwhile for anyone working regularly on the code. I still find git hard and have only just got in to a work flow I feel happy with on individual projects, let alone collaborative ones. On specific questions:

Should we enforce this by protecting the main branch?

I think yes - mainly as I think it will make it easier to encourage the behaviours you desire from the beginning.

Should we remove the options we choose to not use from the list of options offered by GitHub?

As above, but accept that a lot of hand-holding may be needed to begin with.

Should we encourage contributors to clean & reorganise their git commit history before merging their branches? This blog post is a detailed explanation of why this can be beneficial but it is important to note that this is not also straightforward as reorganising commits can easily create conflicts if not done with care.

I think you should encourage this as it definitely makes things a lot more easier for the project as a whole if not the individual. Again, lots of on-boarding and hand-holding to begin with but think it's worthwhile in the long run.

pratikunterwegs commented 2 years ago

Som thoughts:

Should we enforce this by protecting the main branch?

Yes - I think we should assume that people will want to try our packages from the get go, and cloning/installing from GH will get the master branch, which therefore needs to be production grade.

At the same time, each pull request should focus on a single feature.

This is a little more challenging. I - and I think other people - tend to work on a single feature, say an Rcpp implementation of finalsize, but when I hit a snag that needs external input, I will shift to something else, which might well be another feature (in my case, a translation of the Cpp code to R for benchmarking). Other examples will have more divergence between the bottleneck feature and the track-2 feature. My instinct is to optimise my time, rather than to have the commit history be clean - but I'm happy to shift priorities and accept that bottlenecks do happen.

Merge commits/squash/rebase

I would say the squash or rebase options make more sense. I typically rebase when I'm supporting a lead developer, taking their work as canonical and mine as having to conform to their (upstream) choices. I'd say squash and merge is probably better if we do atomic commits (as I do), and you don't want the commit history clogged with quite small changes, if that makes sense.

Should we remove the options we choose to not use from the list of options offered by GitHub?

I don't mind this.

GitHub has released a 4th merge mechanism (currently in private beta): merge queues.

This would be ideal.

pratikunterwegs commented 2 years ago

As a follow up:

I’m a little wary of the advice here https://github.blog/2022-06-30-write-better-commits-build-better-projects/ because I’m not sure we want to present our development process in a way that it didn’t actually happen, sort of prettifying it as it were. It might actually also be a major overhead if we do atomic commits rather than entire files or features all at once

Bisaloo commented 2 years ago

Thanks both for your comments! :+1:

@pratikunterwegs, I understand your concerns and I appreciate using one PR per feature & reorganising commits takes time, which means we might produce slightly less code. However:

I've found it is incredibly valuable (or even, indispensable) for the long-term sustainability of the projects to try to follow these guidelines. As time passes, and new developers join, knowledge is lost and having a clear commit history is an invaluable resource to understand what has been tried, and why the code is how is it now. In older projects, I spend a large amount of time looking at git blame / the commit history to understand things. Some of this information should be stored as comments but it doesn't replace entirely clear commit messages. As you stated, this could slightly reduce the speed at which we produce code, but in the long run, I'm absolutely convinced it's worth it and it will actually save us a lot of time and headaches. Unclear history (admittedly coupled with badly commented code) is the source of the old joke "never touch legacy code", which is definitely something we don't want. Even in the short-term, single-issue PRs and simple logical commits are much easier to review.
As @TimTaylor suggested, and as for every new habit, it may be somewhat annoying at the beginning but I don't find it much of an inconvenience once you get used to it.

Bisaloo commented 2 years ago

I do think you bring up a great point by saying we should not "present our development process in a way it didn't actually happen" though.

I don't not think it's what we would be doing by reordering / cleaning up commits. It would be an entire part of our development process, useful to our stated goals, and not a sneaky attempt at making things look better than they actually do.

I do agree however that we should make clear to potential contributors that this is the result of a conscious effort and that we do not immediately come up with perfect commits, nor should they be expected to.

pratikunterwegs commented 2 years ago

I don't mind either of these policies - I'm yet to see a working example in our context as we don't have a lot of people working on different repos just yet, so it might become clear over time.

thibautjombart commented 2 years ago

Thanks all for this conversation - very instructive and interesting! :)

Probably stating the obvious, but all the above strategies come with a price: overhead for the dev, risks of conflicts, etc. We do want to maximize long-term project sustainability, but we also want to adopt a workflow which everyone feels comfortable with, and minimize overheads as much as possible. So there may not be any perfect solution, and one's take may depend on where one sits on the trade-off of complexity vs benefits for sustainability. I would tend towards implementing minimal viable strategies, and evolving organically from there.

On the points raised so far, my 2-cents

Should we enforce this by protecting the main branch?

Yes. The only drawback I can see there is automation using GH actions to push to main requires privileges but there are workarounds.

Should we remove the options we choose to not use from the list of options offered by GitHub?

I think yes, otherwise it feels like setting up traps for our dev team.

GitHub has released a 4th merge mechanism (currently in private beta): merge queues.

Definitely sounds like a feature we'll want to use. Any idea when this becomes stable?

Merge commits/squash/rebase

It sounds like squashing when merging into main would be a simple way to clean up history. I like seeing the details of the commits when reviewing code, but once it's made it to main, I find more streamlined history easier to read.

Should we encourage contributors to clean & reorganise their git commit history before merging their branches?

I would tend to echo @pratikunterwegs and think rewriting the chronology of code creation may be misleading, though I can see the appeal. For now I would think this might be more overhead for the devs than is needed, but again, this is a personal take. I just have not seen situations where not doing this led to headaches (but admittedly I have not contributed much to large, heavily collaborative projects). If we go down the route of squash-merge, it seems this would only useful before merging, i.e. for code reviews, right?

git versus github

As part of our guidelines, we probably want to be explicit about what events are expected to take place on the github website/client, or locally with git. For instance, one could merge a branch on main locally and push to the remote, which would bypass other processes and close associated PRs. Similarly, one could squash commits before pushing to remote for a PR, which is probably not what we want?

pratikunterwegs commented 2 years ago

git versus github

As part of our guidelines, we probably want to be explicit about what events are expected to take place on the github website/client, or locally with git. For instance, one could merge a branch on main locally and push to the remote, which would bypass other processes and close associated PRs. Similarly, one could squash commits before pushing to remote for a PR, which is probably not what we want?

Thanks for bringing this up @thibautjombart. I would say anything that moves from main to features or similar (rebasing on upstream, say) could be fine when done locally; the reverse - going from feature branches to main locally is a no-no, because it skips all the safeguards as you mention.

Bisaloo commented 2 years ago

Yes. The only drawback I can see there is automation using GH actions to push to main requires privileges but there are workarounds.

What specific workflows do you have in mind? Things as the render-readme workflow? If we agree to do all changes in branches and set the workflow to run in PR targeting main, then the workflow doesn't have to push directly on main and we should be fine.

Definitely sounds like a feature we'll want to use. Any idea when this becomes stable?

No idea. I'll try to investigate.

Merge commits/squash/rebase

It sounds like squashing when merging into main would be a simple way to clean up history. I like seeing the details of the commits when reviewing code, but once it's made it to main, I find more streamlined history easier to read.

Should we encourage contributors to clean & reorganise their git commit history before merging their branches?

I would tend to echo @pratikunterwegs and think rewriting the chronology of code creation may be misleading, though I can see the appeal. For now I would think this might be more overhead for the devs than is needed, but again, this is a personal take. I just have not seen situations where not doing this led to headaches (but admittedly I have not contributed much to large, heavily collaborative projects). If we go down the route of squash-merge, it seems this would only useful before merging, i.e. for code reviews, right?

Yes, if we stick to PRs focused on a single task/feature, then squash & merge is roughly equivalent to reorganising commits (just all commit are reorganised into one). I would be okay with this. The only possible caveat is that I'm not sure what will happen if/once we move to merge groups. I don't know if it will be possible to set it up so that commits are squashed before merging. As I mentioned, the documentation is not very clear yet.

thibautjombart commented 2 years ago

What specific workflows do you have in mind? Things are the render-readme workflow? If we agree to do all changes in branches and set the workflow to run in PR targeting main, then the workflow doesn't have to push directly on main and we should be fine.

Yup. My current workaround is used here and uses a PAT to push to main. Indeed triggering the job to run on PRs would remove the need to push to main. There are two use-cases where this might remain useful:

need for a manual trigger of the job if it failed randomly (seems to happen every now and then)
(more controversial) trivial but urgent fixes pushed to main directly; it may be something we want to discourage entirely, but I have resorted to this in the past in urgent situations: devs on leave, package broken on day 1 of a course where it is needed

Bisaloo commented 2 years ago

need for a manual trigger of the job if it failed randomly (seems to happen every now and then)

If the workflow fails in the PR, we will probably want to restart it from there anyways so it should not be an issue in practice.

(more controversial) trivial but urgent fixes pushed to main directly; it may be something we want to discourage entirely, but I have resorted to this in the past in urgent situations: devs on leave, package broken on day 1 of a course where it is needed

I think that's a fair concern and I have definitely done this in the past but since this project involves many collaborators, it might be good to try and move away from this. As I mentioned in the initial comment, PR are not only useful for reviews, but also to inform the collaborators that a change has been made (even if you merge without waiting for a review because you are in a hurry).

thibautjombart commented 2 years ago

All fair points. I think protecting main has mostly advantages, and the need for GH actions to push to main is probably going to be very rare, if it happens at all. Let's go with it then.

Bisaloo commented 2 years ago

I'm seeing 'Q4 2022 – Oct-Dec' for the merge queues / merge groups on the GitHub public roadmap but it's not clear to me if this is the date for the public beta release (it's currently in limited public beta) or the official release.

thibautjombart commented 2 years ago

Maybe the public beta release would be worth a try in any case? We could use it on a couple of active projects and see how it goes.

pratikunterwegs commented 2 years ago

Unrelated to above: what should we do with merged feature branches? Delete or keep? I would say delete.

Bisaloo commented 2 years ago

Yes, my vote is for "delete" as well. I always set it up to auto-delete in the projects I maintain and would recommend we do the same here.

thibautjombart commented 2 years ago

+1 to delete

thibautjombart commented 2 years ago

Mergin / rebasing locally

We probably also want to discuss our policy on updating branches locally. E.g. if 'dev' is out of date with main, do we want to merge or rebase onto main? I usually do the former, but feel rebasing may be the better option to get clean linear history.

Reviewing PRs

Drawing lessons from this PR, it felt like merging could have happened faster (but curious to hear your thoughts). It underwent multiple rounds of reviews, and while the suggested changes were in general useful (e.g. improving tests), they were not directly linked to the initial purpose of the PR (creating the base structure of the package). I have the impression the following might help in the future:

making minimal PRs, which address a specific need; distinguish essential features from bells and whistles which can be deferred to a later contribution
when reviewing: (learning from my own mistakes here) rather than make comments we leave as 'optional revisions', adopt the following rules:
- if the change is needed to achieve the goal of the PR, request changes
- if the current code breaks the package, request changes
- anything else: raise issues for later implementation, approve the PR, making a note of the new issues raised
What do you think?

Bisaloo commented 2 years ago

We probably also want to discuss our policy on updating branches locally. E.g. if 'dev' is out of date with main, do we want to merge or rebase onto main? I usually do the former, but feel rebasing may be the better option to get clean linear history.

A very important point:

If we use a certain method to merge PRs, we have to use the same method to update our local branches. And vice versa. Using git merge in local branches and merging PR by rebase/squash will create entangled and confusing git histories.

So, the choice we make here to merge PRs will also impact the daily workflow of our devs with their local branches (it should not be an issue for punctual contributors but only for more regular contributors/devs).

TimTaylor commented 2 years ago

I think going the rebase route would be best.

epiverse-trace / blueprints

Policy around git branching & merging process #6

New changes in branches

Merge mechanism

Avoiding merge commit when pulling

Future option: merge queues

On the points raised so far, my 2-cents

git versus github

Mergin / rebasing locally

Reviewing PRs