JuliaLang / Pkg.jl

Pkg - Package manager for the Julia programming language
https://pkgdocs.julialang.org
Other
618 stars 259 forks source link

Packages without Git: plain directories or Mercurial #1506

Closed ghost closed 4 years ago

ghost commented 4 years ago

I refuse to shoot my own feet off by touching Git, but would like to create (local) packages to better manage parts of my code that are generic, and parts that are specific to certain projects (scientific articles). Please allow Pkg.add("directory") with directory being either a) a plain directory without version control, or b) a Mercurial repository.

If public Mercurial repositories were supported, I could contribute packages; without that, I just publish my work on Zenodo for archival. I will not be a masochist and touch Git.

fredrikekre commented 4 years ago

Please allow Pkg.add("directory") with directory being either a) a plain directory without version control

That already works.

ghost commented 4 years ago

Please allow Pkg.add("directory") with directory being either a) a plain directory without version control

That already works.

Tried it: ERROR: Git repository not found at 'path'

fredrikekre commented 4 years ago

Right, sorry, you need Pkg.develop(PackageSpec(path="directory")).

ghost commented 4 years ago

Also:

(v1.2) pkg> ?add
  ...
  If a local path is used as an argument to add, the path needs to be a git
  repository. The project will then track that git repository just like it
  would track a remote repository online.
ghost commented 4 years ago

Right, sorry, you need Pkg.develop(PackageSpec(path="directory")).

Ah, yes, that seems to work.

KristofferC commented 4 years ago

Yeah, add with a URL or path needs to be a git repo but you can get the code in whatever way you want and use develop and point that to a path.

KristofferC commented 4 years ago

I don't think that support for other version control systems will be added. But as been said, you can already use develop on a path (which can then be version controlled however desired).

ghost commented 4 years ago

However, that will restrict easy installable package development to those masochist enough to work with Git, seriously damaging the ecosystem and Julia. I for one will release nothing I develop as packages installable with Pkg.add. There has to exist an alternative to the hell that is working with Git—the worst and most poorly designed piece of software ever released. Pkg.develop is not one, as it is not an end-user solution.

ghost commented 4 years ago

Git aficionados are the scum of the earth—forcing it upon everyone they run across, everywhere they can.

StefanKarpinski commented 4 years ago

If you want to work on adding support for mercurial as an opt-in plugin, that would be great.

StefanKarpinski commented 4 years ago

At a high level, since Julia 1.4, neither installing nor developing Julia packages is in principle tied to git anymore, although the tooling is, of course, much better developed if you are using git.

Since Julia 1.0, a package version is associated with a particular source tree hash, which is content-addressed using git's tree hashing algorithm, but that can be used to hash any source tree, regardless of whether git is used for development or not.

Since Julia 1.4, with the introduction of the Pkg Protocol, it's possible to install packages without using git since packages can be installed using the protocol, which simply serves registries, packages and artifacts as content-addressed tarballs.

From Julia 1.5 onward, using the Pkg Protocol is the default so installing packages using git will be strictly a fallback for unregistered packages that are only available via git repos.

So the bones are there, but someone who cares about this issue needs to drive it and make sure that things actually work for people not using git, otherwise it will never work well. Are you that person, @vomout? Can we count on you to drive this and make sure it's a good experience?

ghost commented 4 years ago

@StefanKarpinski Does this mean that:

I don't have time to write servers and JSON communications etc. (that a quick glance of https://github.com/JuliaPackaging/PkgServer.jl seems to imply), but I could spend some time writing tarball-generation tools from Mercurial repositories (should be pretty straightforward), or tools for generating directory structures to serve from a static web server (that a quick glance of https://github.com/JuliaLang/Pkg.jl/issues/1377 seems to imply).

StefanKarpinski commented 4 years ago

If you add a local path it does have to be a git repository, since otherwise how do you know what the git tree hash is? We could compute it from the source tree, but it's hard to know what should and should not be included in the tree. You can, however, dev a path whether it's a git repo or not.

We will not support using Mercurial changeset hashes because git-tree-sha1 is the specific source tree hashing algorithm we use (and a changeset is not a tree anyway, it's more like commit). We may change source tree hashing algorithms in the future, but it will still be for a tree. What will be necessary is the mechanism for acquiring a tarball from a mercurial repo URL. That acquisition code will be factored out in the near future, at which point adding support for other acquisition methods should be straightforward. That same code will be used by the Pkg client itself and by Pkg servers that serve those tarballs. Stay tuned.

ghost commented 4 years ago

So this is for verification of contents instead of identification? So something like find . -type f|xargs cat|shasum? No problem calculating it from a local tree; the only problem is extra files that shouldn't be there. Why not use something standard like GPG signatures of a tarball then?

If it's not used for verification of contents, then it can be anything; a changeset id. Mercurial changeset ids depend on previous changesets to my knowledge. In Darcs (the most beautiful DVCS) they don't.

Also what is the UUID in Project.toml then for? Its change surely affects the verification signature.

Basically any hosting provider provides tarball links; even basic hgweb (e.g., https://www.mercurial-scm.org/repo/hg/archive/tip.tar.gz is the link to the current version of Mercurial itself). Also hg archive will provide a tarball.

ghost commented 4 years ago

“A changeset ID is a 160-bit identifier that uniquely describes a changeset and its position in the change history of a repository, regardless of which machine or repository it's on. This is represented to the user as a 40 digit hexadecimal number. Technically, a changeset ID is a nodeid.“ https://www.mercurial-scm.org/wiki/ChangeSetID

StefanKarpinski commented 4 years ago

So this is for verification of contents instead of identification?

Both—they are content-addressed: the hash of the contents is the identity.

So something like find . -type f|xargs cat|shasum?

That's a very rough idea of it, but you need to canonicalize the ordering, handle weird names, capture (all and only) significant metadata about each file. You can use Pkg.GitTools.tree_hash to compute the hash of an arbitrary tree. This is an independent implementation that does not require git. Think of it as a tree hashing algorithm with the right properties. That fact that git happens to implement it and use it is a tangential fact.

Why not use something standard like GPG signatures of a tarball then?

That may verify a particular tarball, but the same tree can be turned into a tarball many different ways. You need to specify which aspects of the tree are significant, how to order the contents in the tarball and precisely how to generate the tarball so that equivalent trees produce the same tarball. GPG signatures require a PKI to verify (which we don't want to require). Each signature is different, so if you sign the same tree again, you would get a different result. We don't want any of that. What we want is something just like the way git hashes source tress. So that's what we do.

If it's not used for verification of contents, then it can be anything; a changeset id.

It is used for verification, it cannot be arbitrary.

Mercurial changeset ids depend on previous changesets to my knowledge.

Yes, we used git commits in Pkg1/2. It was a mistake: it means you need to preserve history and clone an entire repository in order to verifiably install a version of a package.

Also what is the UUID in Project.toml then for? Its change surely affects the verification signature.

Have you read the documentation at all? This comment doesn't really make sense.

Basically any hosting provider provides tarball links; even basic hgweb (e.g., https://www.mercurial-scm.org/repo/hg/archive/tip.tar.gz is the link to the current version of Mercurial itself). Also hg archive will provide a tarball.

Yes, that works for versions for which such tarballs exist. However, manifest files can include trees for unregistered versions of packages so long as they can be acquired via the version control system. Knowledge of how to figure out that URL also needs to exist somewhere.

ghost commented 4 years ago

You can use Pkg.GitTools.tree_hash to compute the hash of an arbitrary tree. This is an independent implementation that does not require git.

So no problem installing local tarballs or (clean) trees. The clean tree issue could easily be fixed by supporting a FileManifest.txt listing the files that should actually be in there—but local or any tarballs are really enough.

Have you read the documentation at all? This comment doesn't really make sense.

None of the packages (DelimitedFiles and Printf, and as their dependencies Unicode and Mmap) that one of my projects depends on has a git-tree-sha1 (or version) in the Manifest.toml, only uuid. In another project various packages are also lacking it. So apparently the three hash isn't even needed.

Yes, works for versions for which such tarballs exist.Manifest files can include trees for unregistered versions of packages, however, so long as they can be acquired via the version control system. Knowledge of how to figure out that URL also needs to exist somewhere.

You can get a tarball for any changeset: randomly picked from Mercurial's own repository: https://www.mercurial-scm.org/repo/hg/archive/eb9026a84e83.tar.gz . (Same with Heptapod, a Gitlab fork.) The base name is just the short-form (first 12 characters) changeset id.

So all you need is a mapping from the tree hashes (used as a versioning scheme) to URLs, in no way tied to any particular version control system. Basically, to make a release (in the centralised or other) registry, there should be a system to submit the URL of a tarball, generated in any way one wants (automatically by a VCS hosting platform, through hg archive, or manually). The submission system then calculates the tree hashes from the tarball contents and puts it in the registry with the URL. Mirrors could of course also be supported. For verifying corruption in transit, possibly the tree hash should also be submitted (calculated using provided tools locally from the tarball).

StefanKarpinski commented 4 years ago

None of the packages (DelimitedFiles and Printf, and as their dependencies Unicode and Mmap) that one of my projects depends on has a git-tree-sha1 (or version) in the Manifest.toml,

That's because they are stdlibs and you cannot choose what version you use—you get whatever Julia ships with.

So all you need is a mapping from the tree hashes (used as a versioning scheme) to URLs, in no way tied to any particular version control system.

Roughly, yes.

For verifying corruption in transit, possibly the tree hash should also be submitted (calculated using provided tools locally from the tarball).

Everything served over the Pkg protocol is content addressed, so if you can ask for it, you already know how to verify it.

ghost commented 4 years ago

For verifying corruption in transit, possibly the tree hash should also be submitted (calculated using provided tools locally from the tarball).

Everything served over the Pkg protocol is content addressed, so if you can ask for it, you already know how to verify it.

This would be to make sure the tree hash in the registry matches what the author intended to submit: that the tarball wasn't corrupted when downloaded by the registry. One can imagine the web server serving the tarball being compromised, and serving dangerous packages. The author submitting the tree hash from the local machine helps to avoid such things.

StefanKarpinski commented 4 years ago

We already verify content at every stage of the process. The client, the Pkg server, etc. all verify that anything they install has the correct content hash before doing anything with it (we also verify tarball hashes when those are available before trying to unpack tarballs).

ghost commented 4 years ago

Since Julia 1.4, with the introduction of the Pkg Protocol, it's possible to install packages without using git since packages can be installed using the protocol, which simply serves registries, packages and artifacts as content-addressed tarballs.

Is this actually supposed to work and how? I've tried to ] add both local and online tarballs, but it just tries to clone it as a Git repo.

Or (reading the bit about 1.5), is it only supposed to work for registered packages and unregistered packages are still… eugh… git-only? It should be possible to just install a tarball from an URL instead of forcing heavy (semi-)centralised mechanisms.

ghost commented 4 years ago

I tried the create the following dummy PkgServer that just converts a directory of tarballs into the apparently correct structure, which is then to be rsynced to a static web server. But even the whole JULIA_PKG_SERVER=pkg.julialang.org doesn't seem to do anything in Julia 1.4. It still tries Git. So I didn't get to try if it works at all. (Certainly the diff responses will be wrong, and some things could be improved by automated .htaccess generation to refer to the original tarballs etc.)

Anyway, it's not the right solution. Requiring users to add random registries is dangerous. For things outside the centralised system (FOSS people really have a fetish for centralised distribution systems… did it long before AppStore and GooglePlay etc.… makes the state-communism comparisons seem valid), it's much better to just be able to install a tarball from a trusted URL. (Pretty practical for trying out the codes for a scientific article straight from Zenodo: Just ]add https://zenodo.org/record/...tarball!)

#!/usr/bin/env julia

using Pkg
import Pkg.TOML

const decompress = `gzcat`
const copy = `cp`

function process_tarball(targetdir, tarball, dirname)
    local hash, uuid

    mktempdir() do tmpdir
        run(pipeline(`$decompress $tarball`, `tar -C $tmpdir -x`))
        hash = bytes2hex(Pkg.GitTools.tree_hash(tmpdir))
        chmod(tmpdir, 0o777, recursive=true)
        pkginfo = TOML.parsefile(joinpath(tmpdir, dirname, "Project.toml"))
        uuid = pkginfo["uuid"]
    end
    entry = (uuid, hash)
    basepath = joinpath(targetdir, "registry", uuid)
    targetpath = joinpath(basepath, hash)
    mkpath(basepath)
    run(`cp $tarball $targetpath`)
    return entry
end

tarballdir=ARGS[1]
targetdir=ARGS[2]
web_prefix=""

registries=[]

foreach(readdir(tarballdir)) do tarball
    m = match(r"^(.*)\.(tar\.gz|tgz)$", tarball)
    if isnothing(m)
        warning("Skipping $tarball: does not appear a tarball")
    else
        println("Processing $tarball")
        entry = process_tarball(targetdir, joinpath(tarballdir, tarball), m.captures[1])
        push!(registries, entry)
    end
end

open(joinpath(targetdir, "registries"), write=true) do io
    for (uuid, hash) in sort!(collect(registries))
        println(io, "$web_prefix/registry/$uuid/$hash")
    end
end
StefanKarpinski commented 4 years ago

I don't really care for the attitude here. Nothing personal, but I've just hit my limit on putting up with people on the internet who want to complain and/or tell me how we're doing things wrong. I'm happy to work towards supporting other ways to develop and host packages, but I don't think this conversation is worthwhile to continue from my perspective. I hope you can figure out how to get the Pkg protocol thing working—you may want to ask for help on discourse.

ghost commented 4 years ago

Well you are doing things… fundamentally wrong, although not completely wrong. For example, the uuid+tree hash based Pkg Server thing is a good idea as a persistent cache, but it's not for everything. I'm not going to put the one-off codes for my scientific articles in a centralised system, although I might consider putting well-developed libraries that they depend on there (and then again, maybe not, due to my very bad past experiences with the FOSS herd… the licenses I use due to those experiences might even get my code thrown out from the centralised system).

Much better than saying please ]add MyArticle is to please ]add https://zenodo/…/myarticle.tar.gz… because you want a specific version (which a tarball is a pointer to), and MyArticle might not even end up pointing to my stuff; it might point to someone else's stuff. Sure you could tell to install an UUID, but those are cryptic, and … why put random one-off stuff in a distribution system? The centralised system could surely cache that thing, to ensure persistency (which Zenodo also does) but the primary pointers should be more individualistic, not this FOSS herd stuff of a central system.

StefanKarpinski commented 4 years ago

Being able to add packages via URL to a tarball is certainly a reasonable feature request. You don't really need to insist that we're doing things "fundamentally wrong" and that we are "scum of the earth" in order to request that. This whole interaction hasn't really had the effect of moving that feature higher up on my list of priorities, but I'm sure we'll get to it at some point.

ghost commented 4 years ago

And I don't really care if it's ]add https://tarball or ]develop https://tarball, but making it consciously difficult to install something that is not in a byzantine distribution system or git is just… fundamentally wrong.