Add the ability to git clone a module.

I love that I can build a yaml file like this

imports:
    test_module: test_module/

 git module test_module:
    url: git@github.com:user/test_module.git

and run peru sync to grab this (and the many more submodules) I'd like to use for this project. More and more I find that I want to break projects into separate repositories with their own revision history.

I would love to have the option for peru to git clone the test_module into the target directory. This means I can add the target directory to my .gitignore, work on the super project while using the sub project (test_module in this case) and I have the option to work on the sub project if the super project needs a feature added / changed.

Basically this boils down to I want to decouple my sub projects from my super projects. Just because my super project needs to depend on all of it's sub projects does not mean those sub projects need to be pulled into the super projects source tree (git subtree) or tied to a single commit of each sub project (git submodules)

In my mind I love what sync does but I'd like to change how it initializes things.

This is a common feature request for peru, but it's unlikely to fit in with the way peru is designed. See the discussion at https://github.com/buildinspace/peru/issues/127.

Believe it or not, I tend to be a pretty strong advocate for grouping related projects into a single repo. I'd like to understand better why that approach doesn't work for you, if you have time.

If you just want to clone the master branch of a bunch of git repos inside your project, you might be able to get 80% of what you need with a short bash script. Something like:

#! /bin/bash

update() {
  if [ ! -d "$2" ] ; then
    git clone --recursive "$1" "$2"
  fi
  git -C "$2" pull --ff-only
}

update https://github.com/foo/bar mybar
update https://github.com/foo/baz mybaz
update https://github.com/foo/bam mybam

Ah, sorry to beat a dead horse. Bad cursory glance is bad.

I can move this into the other thread if that's more productive.

So here's what I'm looking for. Potentially Peru isn't the right tool for this but I think it might be. The other tool I'm considering is Ansible but the concept of being able to keep development dependencies in sync with a single command is very appealing.

I have several teams (currently 7) that are all independently working on several different projects. These teams all have a lot of cross over in what they're doing but not enough to consider consolidating them. Often one team will use a component in a different part of their technology stack or they'll use it to push / pull data for very different reasons.

Two examples use cases. We have teams that read cluster logs looking for failures on customer clusters and a team that's looking for failures on development builds of clusters. These use cases are very similar but we found having a single repository to be incredibly bloated.

Most teams just siloed themselves rather than dealing with the hassle of having to have all common cause in a single repository. This lead to a lot of redundant work (calculated in developer-years). After the above teams split ways and developed everything to suit their needs this split lead to tens of thousands of redundant lines of code. It was a hot mess.

Now we have two parser project, two translation layer projects, several different provisioning projects, shell clients, analytics layers, database inserters, alerts systems, etc.

Some projects just overlap on one component. Some projects use almost all of them. Almost all of these projects are under active development. That's really the key behind my ask, the fact that everything is worked on.

git submodules and subtree's don't work very well with our use case..

Submodules turn super project logs into

d61f574 Update submodule to increment git describe
6b2ea1f updating submodule
eb04ed4 Updating submodules after checking out the right branch in each folder
59b93da Updating submodule pointers
d648aaf Updating pointers for submodules.
16db200 Updating submodule references in master project.
37a1ea0 Add missing submodules
3d4f244 Update packaging submodule
83c2d9c Update packaging submodules.
612da3b Remove old submodules
da9ed6a Add Debian/Ubuntu packaging submodules.
bc07065 Use absolute URL for submodules
42c0c17 Add submodule for RPM packages

Submodules are fundamentally broken when you have to change the commit your targeted against weekly if not daily. This leads to a nightmare when you're trying to track down what changes happened and when.

Subtree's have a different problem. Merge conflicts kill us and often we want to work in a development project, in a subproject, to see how that change affects the super project. More often than not our development will be happening in branches. It's also a common case that we have to pull another teams super project if our change affects them in a breaking way.

Potentially git just doesn't support this workflow and I need to look into different revision control systems.

I just saw your original tweet by the way, apologies for being terrible at keeping track of Twitter.

I picked up most of my VCS habits from Facebook, and in that school of thought yours would be a clear use case for a One Repo to Rule Them All.

Most teams just siloed themselves rather than dealing with the hassle of having to have all common cause in a single repository.

I'm curious to hear a little more about the hassles here. Are you talking more about the bloat/hassle of trying to make one project satisfy all of your company's different needs? I think you can make good use of the Monster Repo without going that far, by letting mostly-unrelated projects live in their own subdirectories and just sharing common dependencies as appropriate. Separate Makefiles and Visual Studio solutions and all that stuff, but one shared commit history.

In the Monster, your git logs contain actual commits. Also changes across projects can be atomic, and they don't automatically become unresolvable merge conflicts. That's huge.

The potential downside (depending on who you ask) is that changes in a dependency project immediately affect all the callers. This can make it harder to work on your libraries without breaking everyone else. However, if you're updating dependencies every day, you already have this problem. Unifying the repos will actually make your life easier, because developers will have a chance at telling what callers they broke immediately, instead of waiting for the calling projects to sync and finding out the next day. You no longer have to do anything special to test changes in the caller and the library at the same time.

@oconnor663 I have the same considerations as topic starter, and I appreciate the time you put in explaining this. My concern with a big repo is indeed, as you mention, "that changes in a dependency project immediately affect all the callers". Therefore it feels neat to be able to specify the dependencies cleanly in a separate file, specifying the sha1 in test/prod, but allow specifying a branch while developing.

In the references StackExchange thread you mention the option is thus "One big repo or many small repos with tooling" -- if peru is the tooling and peru.yaml is nicely checked into git, then I think the issue of a unique definition of the entire tree is technically addressed, right? (Assuming you use sha1 refs).

So do I correctly understand that

with volatile repos, the reason for the Monster Repo is that (in practice) the discipline of updating the parents dependencies and pushing both parent and dependency updates will be too flaky and lead to mistakes
peru is designed to work with relatively static, external dependencies
by not allowing the dependency checkout to be a git repo in itself, you enforce the updates in the dependency to be completed in the upstream repo (create a separate checkout of the dependency while working on it, commit+push changes, modify the pery.yml of the parent appropriately, sync, and then use the new features).

Are these the actual reasons, are some of these less relevant, or do you have other reasons? Thanks for your thoughts!

@feliksik it seems like there are a couple questions here, but I might be missing something, so let me know.

Why wan't peru designed to create real clones of your dependencies? The main problem peru sets out to solve is fetching relatively stable 3rd-party code. For that use case, there's certainly nothing wrong with creating real clones, but it turns out that there are more important features that are incompatible with doing that. See https://github.com/buildinspace/peru/issues/127 for the gory details.

Why prefer a large repo with many projects, over smaller repos with tooling? The tools that try to join small repos together run into problems when the repos change frequently, and these problems are difficult to solve:

Your commit history becomes difficult to read, because it's hard to tell what changed in a given dependency update.
git bisect in the parent repo can't tell you which dependency commit caused a bug.
Merge conflicts between two dependency updates can't be automatically resolved.

In contrast, here are the problems you run into in a big unified repo, which I think have good solutions:

The repo gets too big, and git operations get slow.
- On modern hardware, git supports repos with hundreds of thousands of files. If your repo is as large as Google's or Facebook's, you can use some of the Mercurial infrastructure that Facebook has open sourced to scale up even more.
It's hard to change dependencies without breaking callers.
- This problem happens no matter what, and the difference in a unified repo is that you find about it more quickly. That's a good thing. You also gain the ability to make atomic changes in multiple callers and dependencies at the same time. If you find that breaks are happening too frequently, the best way to solve that is with automated testing that runs on every commit.
- Though that's not entirely fair. One thing you can do with split repos is let every caller stay on a different dependency version and update at its own pace. If all of the callers are really completely unrelated to each other, that can work. But if you ever need to update everything at once (say you're changing your logging backend, and every project needs the updated logging library), this will bite you badly, because at update time you'll have to deal with all the bugs you put off.

Thanks for your clarification @oconnor663 !

Hello @oconnor663, thanks for taking the time to explain things for me. There's a couple of other problems with large projects that I don't think there are good solutions to currently.

It's very hard to build structural trust into a large repository

Let's take an extreme example. There are many operating systems that use Python for user space programs. If, it was the case, that Python had grown out of the Arch Linux community and was a directory in the Arch source code then it would be 1) probably more work to pull the Python code base out of Arch for use in other projects and 2) the Arch core contributors and the Python core contributors would have to have full write access to both projects.

Now this might not be a bad thing if both projects have full access. Potentially all individuals involved are awesome individuals and no harm will come out of this arrangement. But the only trust system in this arrangement is your (or other owners) judgement of individuals. Your human trust in them.

Human trust severally limits who you can add to a project. Most projects keep write privileges close to their chest because of the damage write privileges can cause.

An example of this is youtube.com. There's no way to add individual users or teams to a channel. There's just one user and one password per channel. So if multiple people work on a channel everyone gets everything all the time.
The lack of logical partitioning makes it much harder to on board new developers

We're not at the size of Facebook or Google but we're also not a small. As of the time of this writing I'm an owner of a project that contains ~95,000 lines of Python and C code that's had ~500,000 additions and ~ 400,000 deletions in it's lifetime. We act as the lowest unit for several projects built on top of us. I don't know how many LOC it would produce to combine all projects but it would be very large.

Combining all of our projects would be a major barrier to entry. It's already a lot of code to wrap your head around and the control flow isn't always the best as is (all projects acquire cruft).
Large repositories disincentive creating module code

My team has been moving more and more towards a micro service approach to software. This can create some performance headaches down the road since it heavily incentives message passing but the benefits (as far as we see them) for creating components with well defined interfaces that are easy to plug and play are large enough that they out weight the cons. Creating a large repository means that, rather than just git cloning a different component, you actually have to make configuration or code changes to use a different component.

It's also just plain easier to deal with well defined interfaces than it is to modify configurations.
One large repository becomes ungainly when you have different deployment scenarios

This one is the real killer for us. We sell storage clusters to customers and they manage the entire system without us being able to gain access to their network. That's one our big selling points, they own everything.

The super project that uses the base project I own runs on our internal servers for logs that customers have voluntarily sent to us for diagnostic purposes. But we have another project that uses log parsers, bundles them into an executable, and has our customer run them with a rules engine as a diagnostic health check in their network.

The method by which both of these projects are deployed is hugely different and causes a massive logical split between the two. But they both use the same data source, cluster logs.

What this did was cause both teams to create two sets of parsers which hugely overlapping work. When we looked at the cost to the company for time spent doubling our work it was a hefty penny that we want to avoid in the future.

So now I'm looking at different revision control systems or different tooling solutions for git.

This is really interesting for me to think about. Thanks for writing it up. Here are some random thoughts:

The Facebook solution to the privileges problem was "build your whole organization around everyone having commit privileges." So...that's not very helpful to anyone else. But I think one of the ways they do it is worth mentioning: Sometimes it's possible to automate away to role of the Project Expert. Big test suites can make sure nothing breaks, and custom linters can enforce pretty elaborate coding guidelines. Sometimes these systems do a better job than human reviewers. You could specify certain files or regexes that would trigger an email, so for example when I wrote some code that called HTML_literal_dangerous_dont_use_this(), the guy who knew why that function was terrible got looped in automatically.

That doesn't work by itself for open source projects, where there's potentially no trust, and you don't want some asshat to delete all your tests. But inside a company, where the problem is more about preventing mistakes, I think it can be a good way to let the set of committers get big. I don't have a sense of how good the tools for this sort of thing are though, since everything I'm familiar with was custom.
- I'm also wondering whether anyone's tried to do something clever with a git pre-receive hook, together with some kind of webapp for specifying who has commit access to which part of a repository?
Good point about forcing people to write modular code. I don't have any ideas for that one. Question about microservices: How do you version all the different components? Do different services deploy independently of each other, or does everything get updated all at once?
Is there any reason you can't have more than one deployment system working within a single repository? I'm imagining some repo with a JavaAPIServer directory and an AndroidApp directory. Both of them use the LogParser library from a third directory. Even though the first two use totally different build systems, they're both able to build with the code from the third (or maybe share a LogParser JAR built with some third system).

@rawrgulmuffins I am very curious as to what you ended up doing. We have a remarkably similar problem in terms of wanting separate repos, and we would also like to be able to make changes in both in a repo cloned by peru and a full copy of the repo located somewhere else.

@oconnor663 Would you consider adding this functionality on a separate branch? I work at NASA on a pretty large mission and we are having the same struggle as Alex regarding how we represent a complex project with both nearly-static vendor dependencies as well as very dynamic in-house dependencies.

I'd actually like to chat with you about our (current/desired) build system from perhaps even a higher level, since it seems like you have a lot of experience with this. Feel free to shoot me an email at joseph.gibson@nasa.gov.

Unfortunately, hacking peru to lay out full repos would amount to a complete rewrite. The main reason is that peru merges all of the modules you import into a single git tree, and most of peru's filesystem operations boil down to git status and git checkout on that tree. Although you could include an entire hg repository in a git tree in theory, git trees are forbidden from representing the .git folder of another git repo, for security reasons. See https://github.com/buildinspace/peru/issues/127 for more.

buildinspace / peru

Add the ability to git clone a module. #131