UMPsychMethodsCore / MethodsCore

All of the projects that the methods core develops, combined into one repository!
7 stars 0 forks source link

How many git repositories do we need? #31

Closed dankessler closed 12 years ago

dankessler commented 12 years ago

This an issue I was discussing with @rcwelsh in email, but I wanted others to be able to see/chime in, and also one of my email responses got long and ugly and it'd be easier if I could compose it in the GitHub Flavored Markdown to make it prettier.

Here's an email from @rcwelsh

Hi Daniel,

This seems like it (ed: storing all projects in a single repository) will make it messy. Should to projects only be dependent upon stable releases of other projects? Else they are really just the same project? I'm not sure I like the idea of a monolithic repository.

Using the SOM as an example:

A new project should be created for building a frontend to SOM, thus is it not part of SOM. I'll label that as frontendSOM. Let's say it's developed using the stable version of some called V2.2.0 (which was the latest I sent). The policy would be that any changes I make to V2.2.0 would only be to the internals and not to the API. If I decide to make a new API then I would create V3.0.0 and thus development of frontendSOM could continue knowing that the API is guaranteed.

Also, does this then making rolling back a little complicate to be able to roll just one project back?

example timeline with three subtree projects: A,B, and C.

1) modifications to project A and commit 2) modification to Project B and commit 3) modification to Project C and commit 4) modification to Project B and commit.

Rolling back to commit in 2) would then roll back C?

dankessler commented 12 years ago

I think the issues that you have raised are good to consider, but there are ways to deal with them in a single repository model. Moreover, there are some added advantages of the single repository model.

Addressing issues raised in @rcwelsh email:

Rollback

In your scenario above, once history is written, it's generally not wise to rewrite it unless there's a really good version. Instead, the better practice is to introduce new history (commits) that undoes something.

So, taking your example

1) modifications to project A and commit 2) modification to Project B and commit 3) modification to Project C and commit 4) modification to Project B and commit.

If our goal was to roll back to the state of B in (2), we'd add another commit

5) Rollback of Project B to state in (2).

This is pretty easy in git, here's some pseudo code

rm SubtreeB
git checkout (2) -- SubtreeB
git add SubtreeB
git commit -m "Roll back project B to state in commit (2)"

Viewing History for Individual Projects

If, for development sake, you want to see just the history of your particular project, most of the log utilities (including gitk) can be passed arguments to only show commits that touch a given path, e.g. gitk --all -- som/ will give you a gitk visualization that only includes those commits that touched files in the som directory.

If you prefer git log to gitk, I imagine it has a similar feature. Unfortunately, because our projects were initially developed in their own repositories, their early histories has all their files in the repository root, and this gitk approach will only let us visualize the history where they were properly living in a subtree. I merged the old histories in this way because it's the "standard" way (see subtree merges). However, I recently came up with an alternative way that can merge in an independent project, but make it look as if it was properly developed in a subtree the whole time on top of an existing repository. I wrote up my method for doing this in #26. If you want to see better continuity in SOM or other projects, we can retroactively follow a similar method, but it will require us to all do some coordinated conspiratorial history rewriting.

Cost of setting up repos

Moreover, although the cost of creating a quick git repository is very low, in a managed model with a centrally hosted github repository, user forks, and local user clones of forks, each additional repository adds substantial overhead in terms of issue monitoring, fetching, etc.

Development Models

Obviously, having multiple people developing a set of interconnected tools is complicated, so it makes sense for us to have some policies in place to limit collisions. As the code base grows I actually think this will be less of an issue as we'll have more niches to carve out and work on, but in the beginning when we're setting basic standards that impact all of our projects, there are more inter-dependencies.

API Framework: Independent Projects that Can Talk to Each Other

For those new to this stuff, API = application programming interface. The idea is that if you have an existing tool A that works just fine internally, and you want to build a new tool B that will use some features of A, the developer of A builds an API, or a standard set of ways to ask A to do things. The SPM job manager is a kind of an API in a sense, in that you can build up an SPM struct object however you want, then pass it to the job manager and it'll handle things from there.

For example, @rcwelsh would write spm8Batch. It already includes an "API" which is restricted to bash calls. Other programs could be written in any language, so long as they could call bash, and that would figure out exactly what they want to do beforehand, then specify a series of bash commands to be executed, which would hand over processing to spm8Batch.

This is a useful framework in general, and especially when we are interfacing with external tools that for which we don't control the development history (e.g. FSL), we'll need to adopt logic similar to what you propose, and just write our frontends as separate projects.

This approach seems easiest when we're working with tools that are already well established and will mostly be growing internally.

As @rcwelsh proposed in his email, different versions of our projects could be "tagged" using literal git tags or some other sort of policy implementation, so that we have solid expectations of how the API will look (what options we can supply and what they will do), but the developer is free to make internal improvements to how their project works without breaking any wrappers. From time to time, internal features will require updates to the API or changes to the expected behavior. When this happens, any program that calls the API will need to be examined and possibly have portions rewritten to comply with the new API. We can figure out a way to do this sort of tagging and policy implementation in either a single repository or multi-repository model, so I think the issues are orthogonal.

However, since we do have control over the development trajectory of these projects, sometimes wrappers might make apparent the need for changes in the project itself. I envision that over time many of our tools will have some convergent trends as they use a consistent library of error reporting utilities, etc, and roll-out is simplified.

"Ecosystem" Framework: Tightly Connected Interdependent Projects

Some of the things we develop, especially those that are simpler or more fledgling, grow a bit more organically and are not easily thought of as belonging to a self-contained project. Indeed, attempting to classify things this way becomes tricky. Over time, projects may split, merge, or spin-off miniprojects. We don't want to have to bother with defining a full project environment with an API each time we do something like this.

Moreover, many of these projects ought to make use of common libraries (like error handling and logging) so it seems silly to split them up as separate projects. Having them in one repository makes it easier to manage their shared code. They also may not be big enough to really be worth writing an API, and since in many cases they are principally front ends that construct and call APIs, it's unlikely that we'll write another API around an API, unless there's a good motivation based on end-user need.

Miniprojects may be especially dependent on the rest of the environment, and a single repository that tracks an all-encompassing snapshot of all the MethodsCore projects lets us tightly specify not just which version of our Miniproject is released to the public, but the exact status of every other project, which makes testing much easier.

One vs Many Repositories

I don't see any reason we can't stick with one repository. We certainly have some decisions to make in terms of development philosophy, and this may vary from project to project, but certain approaches seem to work in either a single or multi repo environment, whereas others are best served by a single repo environment. Based on that, I think it makes sense to stick to a single repository, but I'm open to hearing thoughts from others.

dankessler commented 12 years ago

@rcwelsh We can talk more about this at our next meeting, but does my reply at least address the concerns you raised in your email?

rcwelsh commented 12 years ago

I'm still consuming this....actually I've yet as I have been coding and pushing data through -- 797 coregistrations yesterday, and 285 vbm8 warps of anatomy wrapping up... :-D

but yes, we can talk at meeting, which i'll send out an email about.

dankessler commented 12 years ago

Cheers to the productivity and looking forward to more conversation :)

On Wed, Mar 14, 2012 at 10:51 AM, Robert < reply@reply.github.com

wrote:

I'm still consuming this....actually I've yet as I have been coding and pushing data through -- 797 coregistrations yesterday, and 285 vbm8 warps of anatomy wrapping up... :-D

but yes, we can talk at meeting, which i'll send out an email about.


Reply to this email directly or view it on GitHub:

https://github.com/UMPsychMethodsCore/MethodsCore/issues/31#issuecomment-4499550

Daniel A. Kessler Research Area Computer Specialist Psychiatry - Rachel Upjohn Building University of Michigan, Ann Arbor kesslerd@umich.edu +1 734.418.8134

dankessler commented 12 years ago

I'm going to close this issue since there's been no activity on it for almost a month.