buildinspace / peru

a generic package manager, for including other people's code in your projects
MIT License
1.12k stars 69 forks source link

Add the ability to git clone a module. #131

Closed rawrgulmuffins closed 8 years ago

rawrgulmuffins commented 9 years ago

I love that I can build a yaml file like this

imports:
    test_module: test_module/

 git module test_module:
    url: git@github.com:user/test_module.git

and run peru sync to grab this (and the many more submodules) I'd like to use for this project. More and more I find that I want to break projects into separate repositories with their own revision history.

I would love to have the option for peru to git clone the test_module into the target directory. This means I can add the target directory to my .gitignore, work on the super project while using the sub project (test_module in this case) and I have the option to work on the sub project if the super project needs a feature added / changed.

Basically this boils down to I want to decouple my sub projects from my super projects. Just because my super project needs to depend on all of it's sub projects does not mean those sub projects need to be pulled into the super projects source tree (git subtree) or tied to a single commit of each sub project (git submodules)

In my mind I love what sync does but I'd like to change how it initializes things.

oconnor663 commented 9 years ago

This is a common feature request for peru, but it's unlikely to fit in with the way peru is designed. See the discussion at https://github.com/buildinspace/peru/issues/127.

Believe it or not, I tend to be a pretty strong advocate for grouping related projects into a single repo. I'd like to understand better why that approach doesn't work for you, if you have time.

If you just want to clone the master branch of a bunch of git repos inside your project, you might be able to get 80% of what you need with a short bash script. Something like:

#! /bin/bash

update() {
  if [ ! -d "$2" ] ; then
    git clone --recursive "$1" "$2"
  fi
  git -C "$2" pull --ff-only
}

update https://github.com/foo/bar mybar
update https://github.com/foo/baz mybaz
update https://github.com/foo/bam mybam
rawrgulmuffins commented 9 years ago

Ah, sorry to beat a dead horse. Bad cursory glance is bad.

I can move this into the other thread if that's more productive.

So here's what I'm looking for. Potentially Peru isn't the right tool for this but I think it might be. The other tool I'm considering is Ansible but the concept of being able to keep development dependencies in sync with a single command is very appealing.

I have several teams (currently 7) that are all independently working on several different projects. These teams all have a lot of cross over in what they're doing but not enough to consider consolidating them. Often one team will use a component in a different part of their technology stack or they'll use it to push / pull data for very different reasons.

Two examples use cases. We have teams that read cluster logs looking for failures on customer clusters and a team that's looking for failures on development builds of clusters. These use cases are very similar but we found having a single repository to be incredibly bloated.

Most teams just siloed themselves rather than dealing with the hassle of having to have all common cause in a single repository. This lead to a lot of redundant work (calculated in developer-years). After the above teams split ways and developed everything to suit their needs this split lead to tens of thousands of redundant lines of code. It was a hot mess.

Now we have two parser project, two translation layer projects, several different provisioning projects, shell clients, analytics layers, database inserters, alerts systems, etc.

Some projects just overlap on one component. Some projects use almost all of them. Almost all of these projects are under active development. That's really the key behind my ask, the fact that everything is worked on.

git submodules and subtree's don't work very well with our use case..

Submodules turn super project logs into

d61f574 Update submodule to increment git describe
6b2ea1f updating submodule
eb04ed4 Updating submodules after checking out the right branch in each folder
59b93da Updating submodule pointers
d648aaf Updating pointers for submodules.
16db200 Updating submodule references in master project.
37a1ea0 Add missing submodules
3d4f244 Update packaging submodule
83c2d9c Update packaging submodules.
612da3b Remove old submodules
da9ed6a Add Debian/Ubuntu packaging submodules.
bc07065 Use absolute URL for submodules
42c0c17 Add submodule for RPM packages

Submodules are fundamentally broken when you have to change the commit your targeted against weekly if not daily. This leads to a nightmare when you're trying to track down what changes happened and when.

Subtree's have a different problem. Merge conflicts kill us and often we want to work in a development project, in a subproject, to see how that change affects the super project. More often than not our development will be happening in branches. It's also a common case that we have to pull another teams super project if our change affects them in a breaking way.

Potentially git just doesn't support this workflow and I need to look into different revision control systems.

oconnor663 commented 9 years ago

I just saw your original tweet by the way, apologies for being terrible at keeping track of Twitter.

I picked up most of my VCS habits from Facebook, and in that school of thought yours would be a clear use case for a One Repo to Rule Them All.

Most teams just siloed themselves rather than dealing with the hassle of having to have all common cause in a single repository.

I'm curious to hear a little more about the hassles here. Are you talking more about the bloat/hassle of trying to make one project satisfy all of your company's different needs? I think you can make good use of the Monster Repo without going that far, by letting mostly-unrelated projects live in their own subdirectories and just sharing common dependencies as appropriate. Separate Makefiles and Visual Studio solutions and all that stuff, but one shared commit history.

In the Monster, your git logs contain actual commits. Also changes across projects can be atomic, and they don't automatically become unresolvable merge conflicts. That's huge.

The potential downside (depending on who you ask) is that changes in a dependency project immediately affect all the callers. This can make it harder to work on your libraries without breaking everyone else. However, if you're updating dependencies every day, you already have this problem. Unifying the repos will actually make your life easier, because developers will have a chance at telling what callers they broke immediately, instead of waiting for the calling projects to sync and finding out the next day. You no longer have to do anything special to test changes in the caller and the library at the same time.

feliksik commented 9 years ago

@oconnor663 I have the same considerations as topic starter, and I appreciate the time you put in explaining this. My concern with a big repo is indeed, as you mention, "that changes in a dependency project immediately affect all the callers". Therefore it feels neat to be able to specify the dependencies cleanly in a separate file, specifying the sha1 in test/prod, but allow specifying a branch while developing.

In the references StackExchange thread you mention the option is thus "One big repo or many small repos with tooling" -- if peru is the tooling and peru.yaml is nicely checked into git, then I think the issue of a unique definition of the entire tree is technically addressed, right? (Assuming you use sha1 refs).

So do I correctly understand that

Are these the actual reasons, are some of these less relevant, or do you have other reasons? Thanks for your thoughts!

oconnor663 commented 9 years ago

@feliksik it seems like there are a couple questions here, but I might be missing something, so let me know.

Why wan't peru designed to create real clones of your dependencies? The main problem peru sets out to solve is fetching relatively stable 3rd-party code. For that use case, there's certainly nothing wrong with creating real clones, but it turns out that there are more important features that are incompatible with doing that. See https://github.com/buildinspace/peru/issues/127 for the gory details.

Why prefer a large repo with many projects, over smaller repos with tooling? The tools that try to join small repos together run into problems when the repos change frequently, and these problems are difficult to solve:

In contrast, here are the problems you run into in a big unified repo, which I think have good solutions:

feliksik commented 9 years ago

Thanks for your clarification @oconnor663 !

rawrgulmuffins commented 9 years ago

Hello @oconnor663, thanks for taking the time to explain things for me. There's a couple of other problems with large projects that I don't think there are good solutions to currently.

So now I'm looking at different revision control systems or different tooling solutions for git.

oconnor663 commented 9 years ago

This is really interesting for me to think about. Thanks for writing it up. Here are some random thoughts:

gibsjose commented 8 years ago

@rawrgulmuffins I am very curious as to what you ended up doing. We have a remarkably similar problem in terms of wanting separate repos, and we would also like to be able to make changes in both in a repo cloned by peru and a full copy of the repo located somewhere else.

@oconnor663 Would you consider adding this functionality on a separate branch? I work at NASA on a pretty large mission and we are having the same struggle as Alex regarding how we represent a complex project with both nearly-static vendor dependencies as well as very dynamic in-house dependencies.

I'd actually like to chat with you about our (current/desired) build system from perhaps even a higher level, since it seems like you have a lot of experience with this. Feel free to shoot me an email at joseph.gibson@nasa.gov.

oconnor663 commented 8 years ago

Unfortunately, hacking peru to lay out full repos would amount to a complete rewrite. The main reason is that peru merges all of the modules you import into a single git tree, and most of peru's filesystem operations boil down to git status and git checkout on that tree. Although you could include an entire hg repository in a git tree in theory, git trees are forbidden from representing the .git folder of another git repo, for security reasons. See https://github.com/buildinspace/peru/issues/127 for more.