Closed rawrgulmuffins closed 8 years ago
This is a common feature request for peru, but it's unlikely to fit in with the way peru is designed. See the discussion at https://github.com/buildinspace/peru/issues/127.
Believe it or not, I tend to be a pretty strong advocate for grouping related projects into a single repo. I'd like to understand better why that approach doesn't work for you, if you have time.
If you just want to clone the master branch of a bunch of git repos inside your project, you might be able to get 80% of what you need with a short bash script. Something like:
#! /bin/bash
update() {
if [ ! -d "$2" ] ; then
git clone --recursive "$1" "$2"
fi
git -C "$2" pull --ff-only
}
update https://github.com/foo/bar mybar
update https://github.com/foo/baz mybaz
update https://github.com/foo/bam mybam
Ah, sorry to beat a dead horse. Bad cursory glance is bad.
I can move this into the other thread if that's more productive.
So here's what I'm looking for. Potentially Peru isn't the right tool for this but I think it might be. The other tool I'm considering is Ansible but the concept of being able to keep development dependencies in sync with a single command is very appealing.
I have several teams (currently 7) that are all independently working on several different projects. These teams all have a lot of cross over in what they're doing but not enough to consider consolidating them. Often one team will use a component in a different part of their technology stack or they'll use it to push / pull data for very different reasons.
Two examples use cases. We have teams that read cluster logs looking for failures on customer clusters and a team that's looking for failures on development builds of clusters. These use cases are very similar but we found having a single repository to be incredibly bloated.
Most teams just siloed themselves rather than dealing with the hassle of having to have all common cause in a single repository. This lead to a lot of redundant work (calculated in developer-years). After the above teams split ways and developed everything to suit their needs this split lead to tens of thousands of redundant lines of code. It was a hot mess.
Now we have two parser project, two translation layer projects, several different provisioning projects, shell clients, analytics layers, database inserters, alerts systems, etc.
Some projects just overlap on one component. Some projects use almost all of them. Almost all of these projects are under active development. That's really the key behind my ask, the fact that everything is worked on.
git submodules and subtree's don't work very well with our use case..
Submodules turn super project logs into
d61f574 Update submodule to increment git describe
6b2ea1f updating submodule
eb04ed4 Updating submodules after checking out the right branch in each folder
59b93da Updating submodule pointers
d648aaf Updating pointers for submodules.
16db200 Updating submodule references in master project.
37a1ea0 Add missing submodules
3d4f244 Update packaging submodule
83c2d9c Update packaging submodules.
612da3b Remove old submodules
da9ed6a Add Debian/Ubuntu packaging submodules.
bc07065 Use absolute URL for submodules
42c0c17 Add submodule for RPM packages
Submodules are fundamentally broken when you have to change the commit your targeted against weekly if not daily. This leads to a nightmare when you're trying to track down what changes happened and when.
Subtree's have a different problem. Merge conflicts kill us and often we want to work in a development project, in a subproject, to see how that change affects the super project. More often than not our development will be happening in branches. It's also a common case that we have to pull another teams super project if our change affects them in a breaking way.
Potentially git just doesn't support this workflow and I need to look into different revision control systems.
I just saw your original tweet by the way, apologies for being terrible at keeping track of Twitter.
I picked up most of my VCS habits from Facebook, and in that school of thought yours would be a clear use case for a One Repo to Rule Them All.
Most teams just siloed themselves rather than dealing with the hassle of having to have all common cause in a single repository.
I'm curious to hear a little more about the hassles here. Are you talking more about the bloat/hassle of trying to make one project satisfy all of your company's different needs? I think you can make good use of the Monster Repo without going that far, by letting mostly-unrelated projects live in their own subdirectories and just sharing common dependencies as appropriate. Separate Makefiles and Visual Studio solutions and all that stuff, but one shared commit history.
In the Monster, your git logs contain actual commits. Also changes across projects can be atomic, and they don't automatically become unresolvable merge conflicts. That's huge.
The potential downside (depending on who you ask) is that changes in a dependency project immediately affect all the callers. This can make it harder to work on your libraries without breaking everyone else. However, if you're updating dependencies every day, you already have this problem. Unifying the repos will actually make your life easier, because developers will have a chance at telling what callers they broke immediately, instead of waiting for the calling projects to sync and finding out the next day. You no longer have to do anything special to test changes in the caller and the library at the same time.
@oconnor663 I have the same considerations as topic starter, and I appreciate the time you put in explaining this. My concern with a big repo is indeed, as you mention, "that changes in a dependency project immediately affect all the callers". Therefore it feels neat to be able to specify the dependencies cleanly in a separate file, specifying the sha1 in test/prod, but allow specifying a branch while developing.
In the references StackExchange thread you mention the option is thus "One big repo or many small repos with tooling" -- if peru is the tooling and peru.yaml is nicely checked into git, then I think the issue of a unique definition of the entire tree is technically addressed, right? (Assuming you use sha1 refs).
So do I correctly understand that
Are these the actual reasons, are some of these less relevant, or do you have other reasons? Thanks for your thoughts!
@feliksik it seems like there are a couple questions here, but I might be missing something, so let me know.
Why wan't peru designed to create real clones of your dependencies? The main problem peru sets out to solve is fetching relatively stable 3rd-party code. For that use case, there's certainly nothing wrong with creating real clones, but it turns out that there are more important features that are incompatible with doing that. See https://github.com/buildinspace/peru/issues/127 for the gory details.
Why prefer a large repo with many projects, over smaller repos with tooling? The tools that try to join small repos together run into problems when the repos change frequently, and these problems are difficult to solve:
git bisect
in the parent repo can't tell you which dependency commit caused a bug.In contrast, here are the problems you run into in a big unified repo, which I think have good solutions:
git
operations get slow.
Thanks for your clarification @oconnor663 !
Hello @oconnor663, thanks for taking the time to explain things for me. There's a couple of other problems with large projects that I don't think there are good solutions to currently.
It's very hard to build structural trust into a large repository
Let's take an extreme example. There are many operating systems that use Python for user space programs. If, it was the case, that Python had grown out of the Arch Linux community and was a directory in the Arch source code then it would be 1) probably more work to pull the Python code base out of Arch for use in other projects and 2) the Arch core contributors and the Python core contributors would have to have full write access to both projects.
Now this might not be a bad thing if both projects have full access. Potentially all individuals involved are awesome individuals and no harm will come out of this arrangement. But the only trust system in this arrangement is your (or other owners) judgement of individuals. Your human trust in them.
Human trust severally limits who you can add to a project. Most projects keep write privileges close to their chest because of the damage write privileges can cause.
An example of this is youtube.com
. There's no way to add individual users or teams to a channel. There's just one user and one password per channel. So if multiple people work on a channel everyone gets everything all the time.
The lack of logical partitioning makes it much harder to on board new developers
We're not at the size of Facebook or Google but we're also not a small. As of the time of this writing I'm an owner of a project that contains ~95,000 lines of Python and C code that's had ~500,000 additions and ~ 400,000 deletions in it's lifetime. We act as the lowest unit for several projects built on top of us. I don't know how many LOC it would produce to combine all projects but it would be very large.
Combining all of our projects would be a major barrier to entry. It's already a lot of code to wrap your head around and the control flow isn't always the best as is (all projects acquire cruft).
Large repositories disincentive creating module code
My team has been moving more and more towards a micro service approach to software. This can create some performance headaches down the road since it heavily incentives message passing but the benefits (as far as we see them) for creating components with well defined interfaces that are easy to plug and play are large enough that they out weight the cons. Creating a large repository means that, rather than just git cloning a different component, you actually have to make configuration or code changes to use a different component.
It's also just plain easier to deal with well defined interfaces than it is to modify configurations.
One large repository becomes ungainly when you have different deployment scenarios
This one is the real killer for us. We sell storage clusters to customers and they manage the entire system without us being able to gain access to their network. That's one our big selling points, they own everything.
The super project that uses the base project I own runs on our internal servers for logs that customers have voluntarily sent to us for diagnostic purposes. But we have another project that uses log parsers, bundles them into an executable, and has our customer run them with a rules engine as a diagnostic health check in their network.
The method by which both of these projects are deployed is hugely different and causes a massive logical split between the two. But they both use the same data source, cluster logs.
What this did was cause both teams to create two sets of parsers which hugely overlapping work. When we looked at the cost to the company for time spent doubling our work it was a hefty penny that we want to avoid in the future.
So now I'm looking at different revision control systems or different tooling solutions for git.
This is really interesting for me to think about. Thanks for writing it up. Here are some random thoughts:
The Facebook solution to the privileges problem was "build your whole organization around everyone having commit privileges." So...that's not very helpful to anyone else. But I think one of the ways they do it is worth mentioning: Sometimes it's possible to automate away to role of the Project Expert. Big test suites can make sure nothing breaks, and custom linters can enforce pretty elaborate coding guidelines. Sometimes these systems do a better job than human reviewers. You could specify certain files or regexes that would trigger an email, so for example when I wrote some code that called HTML_literal_dangerous_dont_use_this()
, the guy who knew why that function was terrible got looped in automatically.
That doesn't work by itself for open source projects, where there's potentially no trust, and you don't want some asshat to delete all your tests. But inside a company, where the problem is more about preventing mistakes, I think it can be a good way to let the set of committers get big. I don't have a sense of how good the tools for this sort of thing are though, since everything I'm familiar with was custom.
JavaAPIServer
directory and an AndroidApp
directory. Both of them use the LogParser
library from a third directory. Even though the first two use totally different build systems, they're both able to build with the code from the third (or maybe share a LogParser
JAR built with some third system).@rawrgulmuffins
I am very curious as to what you ended up doing. We have a remarkably similar problem in terms of wanting separate repos, and we would also like to be able to make changes in both in a repo cloned by peru
and a full copy of the repo located somewhere else.
@oconnor663 Would you consider adding this functionality on a separate branch? I work at NASA on a pretty large mission and we are having the same struggle as Alex regarding how we represent a complex project with both nearly-static vendor dependencies as well as very dynamic in-house dependencies.
I'd actually like to chat with you about our (current/desired) build system from perhaps even a higher level, since it seems like you have a lot of experience with this. Feel free to shoot me an email at joseph.gibson@nasa.gov.
Unfortunately, hacking peru to lay out full repos would amount to a complete rewrite. The main reason is that peru merges all of the modules you import into a single git tree, and most of peru's filesystem operations boil down to git status
and git checkout
on that tree. Although you could include an entire hg repository in a git tree in theory, git trees are forbidden from representing the .git
folder of another git repo, for security reasons. See https://github.com/buildinspace/peru/issues/127 for more.
I love that I can build a yaml file like this
and run
peru sync
to grab this (and the many more submodules) I'd like to use for this project. More and more I find that I want to break projects into separate repositories with their own revision history.I would love to have the option for peru to git clone the test_module into the target directory. This means I can add the target directory to my
.gitignore
, work on the super project while using the sub project (test_module in this case) and I have the option to work on the sub project if the super project needs a feature added / changed.Basically this boils down to I want to decouple my sub projects from my super projects. Just because my super project needs to depend on all of it's sub projects does not mean those sub projects need to be pulled into the super projects source tree (git subtree) or tied to a single commit of each sub project (git submodules)
In my mind I love what sync does but I'd like to change how it initializes things.