crystal-lang / shards

Dependency manager for the Crystal language
Other
762 stars 100 forks source link

Add option to clone without git history. #523

Open syeopite opened 3 years ago

syeopite commented 3 years ago

I've been working on a few shards that could potentially get massive git histories. Would it be possible to add a configuration option to only fetch the latest revision?

straight-shoota commented 3 years ago

This topic has been touched before in #180 but it halted at the resolution of the most promising enhancements, global git cache and bare clone.

Cloning with --depth=1 could be feasible, but it's limited by the unavailability of tags. That makes it impossible to discover releases.

I believe it can work for shards install --frozen. That shouldn't require any release discovery. So it would only be a speedup for reproducible installs (such as in production environments). But I think that would definitly be worthwile, even with moderately sized git histories.

Obviously, a shallow clone must not be placed in the shards cache directory.

For other use cases, I don't think we can easily work with shallow clones. Perhaps we could try to discover versions through an external API such as GitHub API or shardbox.org (#16). That adds quite some complexity, though. Since shards are cached, the impact is probably not that relevant for many practical applications (except a long initial download and occupation of file space, but both are usually negligable).

What kind of shards are you talking about that would result in massive git histories? And in what ball park would you expect the file size of such repositories?

syeopite commented 3 years ago

What kind of shards are you talking about that would result in massive git histories...

CLDR/ICU implementation. One full version of the data is ~340 MB and with each release cycle that'll continuously grow. And although it's not multi-gigabyte sized, it's still far larger than most crystal projects.

There's also been talks on splitting the Invidious repo into a frontend and backend while also preserving history. The latter of which would cost ~66 mbs worth of space that could've been only ~7mb or so.

Maybe, I've exaggerated a bit on my original post and it's not massive massive, but sizes like what I mentioned above are still rather large. Cloning without the git history would help a lot in that regards.

straight-shoota commented 3 years ago

If a full version is already 340MB, you obviously can't make that any smaller with checking out only a single revision. I'd assume that such localization content mostly just grows with updates (new content is mostly appended instead of replacing old one). So later versions are just going to get bigger. I reckon you probably wont save a lot by downloading only a later version than the entire history.

straight-shoota commented 3 years ago

For this use case I'd suggest to consider other options for making the data available. Maybe it can be retrieved through other means more efficiently?

I'd figure for many applications you'd only need a fraction of the entire database, for example.

syeopite commented 3 years ago

Hmm. You're right. Embedding the full 340mb doesn't seem like a good idea in hindsight. I'll go ahead and split the architecture to use multiple (optional) extension shards instead.

Thanks for the help!

straight-shoota commented 3 years ago

Great!

I'd still keep this issue open for the use case of shards install --frozen.

robacarp commented 2 years ago

More generally, this can be dangerous behavior. Homebrew ran shallow clones for a long time but recently un-shallowed at the specific request of Github. See here for their discussion thread on the topic, but it boils down to: shallow clones are harder on all systems involved.

Blacksmoke16 commented 2 years ago

What about downloading/extracting a tarball/zipball of the content versus dealing with git at all (at least when installing from shard.lock)? Github exposes /tarball and /zipball endpoints that allow getting an archive based on a branch/tag/commit. E.g. https://github.com/owner/name/tarball/abc123.

Another benefit of this is it would allow leveraging export-ignore git attribute, such as for not shipping spec/ when installing.

EDIT: Composer allows controlling which is used, see https://getcomposer.org/doc/06-config.md#preferred-install.

robacarp commented 2 years ago

I'm pretty sure I'm not in the majority here but I'm all for downloading shards as packaged, signed code. 100%. I'd much prefer to have a way to install a shard by downloading from arbitrary https endpoints rather than requiring me to clone down a repository I have no intention of using as such.

shards commitment to VCS based distribution has been convenient in many ways but it also means that for N version control systems, shards needs (at least) N something methods for pulling dependencies.

We currently have:

The trend here is to add code to the shards repo every time there's a shiny new version control system -- Fossil is pretty great, but hasn't made the list yet. All of these provide compressed package endpoints.

Adding a packaged code installer would also allow for sites like crystal-shards.whatever to host signed distributables so that github doesn't ~start~ get to continue to act as a large failure nexus for crystal dependencies, and the community can decide what that means.

I guess what I really want to say is: contributions are welcome to add new shard sources, but what if they weren't even necessary?

straight-shoota commented 2 years ago

@robacarp Not sure what you're suggesting here. Some parts sound more like a completely different package manager than an evolution of shards. Shards won't move away from being based on VCS. If we add alternative download methods, they must fit into the existing framework.

Since you mentioned signed code, I think that's something we would lose when installing from a Github tarball instead of checking out the repository. Commits can be signed. Tarballs are AFAIK not signed.