ScottArbeit / Grace

Grace Version Control System
MIT License
545 stars 15 forks source link

Keeping forks on the same server is a security vulnerability and legal risk for the server host, and an operational risk for the maintainer of the fork. #18

Closed MrJoy closed 6 months ago

MrJoy commented 6 months ago

Scott,

You've got some interesting ideas here, and it's nice to see someone tackling the UX issues that git has. That said, as someone who's been the technical co-founder of a number of companies (including Cloudability, which directly informs my thoughts here), my first impressions upon reading through the introductory documentation are as follows:

  1. If I publish an open source project, and someone who wants to fork it does so on my server, then they have an easy way to forcibly increase my OpEx spend pretty arbitrarily. They can just make a fork, and start adding multi-gigabyte files by the boatload to inflate my S3 bill. Of course, if I can ban someone from doing that on my server, then as a fork maintainer I face the risk that my fork could be shut down at a moment's notice by someone outside of my organization.

  2. Similarly, the fork could be used to host illegal content by a bad actor. As the host of that content, it's entirely reasonable to expect that laws in many jurisdictions may make me liable for that. So, anyone hosting a server for the sake of hosting their OSS project may wind up with a de facto obligation to closely monitor everything being done by anyone forking their project.

  3. A common use-case for me in forking repositories is simply to ensure that they continue to exist if the original developer decides to delete them. Requiring that the fork be on the same server is only a viable strategy if all the projects I rely on are using the Grace equivalent of Github. The moment one of them has their own server is the moment when I am either forced to come up with a way to continuously pull updates from one Grace server to another, or carry a risk that a dependency may simply disappear from under me. That adds to my operational risks.

I have other concerns, as well.

One you might have a strategy for addressing, and it might be helpful if you address it in the FAQ: I routinely make use of git-lfs, and have a few repos that are simply too large to work with over an Internet connection -- I host a server for them in my home where I have robust, high-speed, local network access. Think 5.9GB .git folder with a 2GB working directory. I also used to do game development, where large files being updated frequently is a common occurrence. For example, in one project I have a PhotoShop file that's 247MB. Having that much data get tossed around every time the artist hits save, if we happen to be working on related branches, could be pretty disruptive. Even if I have the bandwidth to handle it, the amount of time it would take would make the "instantly pick up [other] updates and automatically rebase" aspect would be gated on those transfers. For scale: In that repository, there are dozens of .psd, .fbx, and .max files that are 10+MB. I suspect this is a case where my requirements are simply out of scope for what you're trying to achieve. Worth talking about in your FAQs either way, perhaps.

My largest other concern is one you've explicitly decided is out-of-scope (poor/intermittent network access), so I won't delve into that other than to point out that it's not as simple as "sometimes I'm on an airplane". In fact, I rarely fly. I am, however, routinely in areas where I'm on cellular Internet and access is marginal at best. Even if I can maintain a sufficiently stable, fast connection, cost could easily be a concern for me. Not that that should necessarily impact your thinking on this particular architectural decision. What might be worth considering, however, is the impact on my organization's productivity if the VCS server becomes unavailable. You can make it as fault tolerant as you want, but an operational error / backhoe error / etc would mean that my entire team is unable to continue development without losing the ability to at least incrementally snapshot their work and go back to previous iterations if needed. Local caching of saves/checkpoints might adequately address that concern, however so perhaps I'm overly worried there.

Relatedly, I will note, another common use-case for me is repositories that are only ever local. I don't know if that's a common occurrence or if I'm just being idiosyncratic with it. So it may very well be not worth addressing. I suspect though that if your thinking is driven by your experience at Github that you might not really have a view on how common such a use-case is, as it wouldn't come up in discussions with development teams about their professional usage. It could be, hypothetically that most developers do this and it simply never came up in conversation. Of course, I doubt it's most, but simply want to note that it's very hard to have confidence on how common such a use-case is.

All of that said, I appreciate that someone is pushing forward with bold ideas to try and address the substantial UX issues git has and wish you the best of luck. However things turn out for Grace, I look forward to seeing learnings from it driving the state of the art forward in the future.

ScottArbeit commented 6 months ago

Hi @MrJoy! Thank you for the long comment, I appreciate the time that took.

Forks

This part is simple: Like Git, Grace doesn't have forks. (Git Hub has forks, not Git.)

Unlike Git, Grace will allow you to have a private branch on a public repo, eliminating the need for forks. In Grace authorization (which I haven't written yet, but...) will be not just at the repo level, but also at the branch level, and probably at the reference level, and at the directory and file level too. Because you'll have been the one to create the private branch, Grace will allow a hoster to attribute storage use to you vs. the owner of the public repo.

GitHub has put a lot of engineering effort into supporting networks of forks without blowing out storage requirements; i.e. having 52k forks of https://github.com/torvalds/linux doesn't mean that there are 52k full copies of that repo on GitHub's servers. Grace is designed to minimize the effort of supporting lots of users who want to keep an eye on, or actually work on, public repos.

And being a version control hoster, like being a hoster for anything that allows for file uploads, means taking responsibility for monitoring for abusive behavior. Replacing Git with Grace doesn't change that at all. Grace's architecture and data structure might make it easier to handle and respond to abusive behavior, but only marginally.

forking repositories is simply to ensure that they continue to exist

When you connect with a Grace repo, you'll get the latest version of the code downloaded to your machine. In Git terms, Grace is always and only in (roughly) a partial clone state.

If you want to save the code, what actually matters is that you have a full copy of the code on your machine that compiles, not that you have the entire history of the repo locally.

I have no idea if the hoster or the repo owner will allow a git bundle file download from Grace repos, but that's the export mechanism I intend to build if having the history locally somehow matters. By default, that bundle will be only the latest versions of each branch, but I'll have to support some sort of history export as well.

As I repeat... the only reason the idea of having the entire history of a repo on our local box seems like a requirement is just that we're used to Git and Git is (pseudo-)distributed. None of that is necessary for the vast majority of individuals and teams working on software together.

large files + auto-save

grace watch is most obviously meant for programmers manipulating text files. Grace supports large files, and so I hope it becomes an offramp from Perforce, and grace watch will have to have options like "automatically upload in these directories, but not in those directories".

It'll have to have lots of options, some of them I can imagine, some I'll be surprised by.

And Grace will run without grace watch, but it'll make your experience using Grace slower and remove important features from being possible.

Offline

Grace is deliberately and proudly centralized. Network conditions are better than ever, in general, in 2024, and they'll be even better in 2026 and 2028 and 2030. That's what I'm designing Grace for. It would be malpractice not to think about that, when this thing couldn't possibly hit production before 2026.

I could imagine enabling something like being able to queue up commands on your own branch in a local JSON file or whatever while you're not connected, but that's about all that could be done. Switching branches won't be possible, grace promote (like git push) won't be possible, grace rebase won't be possible, grace status won't work, grace refs won't work, etc. Offline is simply a seriously degraded experience in Grace, and that's OK.

Probably all could be enabled are grace save / checkpoint / commit / tag. Is that worth spending engineering effort on? Meh. If someone wants to contribute it, I wouldn't say no, but it's very low on my priority list, and would probably have wait past v1.0.

Local

Local version control means distributed version control, and that's what Git and Pijul and jj and lots of other distributed VCS's are for. I'm happy to cede that use case to them.

Supporting a local-only Grace repo would mean Grace is decentralized. But it's not, and it's never going to be. And that's OK.

I don't think will matter to most users if, when they say grace repo create, it creates a private-by-default repo on a server vs. locally. If what you want really is a local-only repo, I'm not taking Git away from anyone.

MrJoy commented 6 months ago

If you want to save the code, what actually matters is that you have a full copy of the code on your machine that compiles, not that you have the entire history of the repo locally.

First off: Please don't tell me what matters to me. You cannot read my mind, and trying to do so is disrespectful.

What you describe is not the use-case I'm talking about.

I am talking about business continuity. The local machine is not relevant here. It is sometimes the case that I will maintain a fork of a dependency in order to ensure that the author can't pull the rug out from under me by deleting the repository. It is also often the case that I want to maintain a patched fork on an ongoing basis where trying to upstream the changes is not a goal.

In both cases I want to continue to have parity with changes from upstream on an ongoing basis.

To give you a couple concrete examples: For a long time, as part of my day job, we maintained a fork of ActiveAdmin that entirely removed the dependency on sprockets (in favor of the Webpacker pipeline). This was about reducing the number of dependencies that got installed in order to speed up deployments, and make them more reliable (we'd had recurring issues with sassc compilation problems).

A team I am involved with also maintains a fork of Packer to fix a problem with its DAG handling that presents a serious performance problem for us, in certain scenarios. Having seen a previous company I worked at try to get fixes upstreamed with HashiCorp in the past, we haven't bothered trying to upstream this fix -- it's just not worth the effort.

In a number of cases, when a dependency is relatively esoteric, or old, we maintain a fork primarily in case the author decides to delete the repository.

In all cases, we want access to the history on an ongoing basis to aid in debugging and to help us understand how best to implement patches in order to avoid going against the grain and winding up with challenges in pulling in upstream changes. Now, maybe Grace's AI features would make that last part less urgent, but being able to understand and review the history -- even if the original author deletes their repo -- is important to us.

Anyway, I've offered my two cents. Once again, I hope to see interesting ideas come out of Grace -- but as it doesn't meet my requirements it's probably best if I bow out of this conversation and stop wasting your time.