JuliaLang / Pkg.jl

Pkg - Package manager for the Julia programming language
https://pkgdocs.julialang.org
Other
616 stars 258 forks source link

Allow optional commit and tag metadata in Manifests and registries #3718

Open simonbyrne opened 8 months ago

simonbyrne commented 8 months ago

Currently we only identify versions by their git-tree-sha1. However this is sub-optimal when looking up git histories: GitHub doesn't provide a convenient way to find commits of a given tree, which means that e.g. TagBot has to jump through all sorts of funny hoops to try to link the tag back to a given commit.

I propose the following:

cc: @IanButterworth

Current PRs:

simonbyrne commented 8 months ago

Another use case would be rewriting file paths in CI stacktraces so that we can provide a HTML link to the GitHub URL

DilumAluthge commented 8 months ago

Can we split this into two separate issues, one for the General registry (Versions.toml), and a separate one for local Manifest.toml files? It seems to me that those two can be implemented independently of each other.

simonbyrne commented 8 months ago

I think you would need the registry one first, no?

simonbyrne commented 8 months ago

What would actually need to be done for the registry changes?

From what I can tell, it seems that: https://github.com/JuliaRegistries/RegistryTools.jl/blob/77e2a02e62185ce865653bdae95203a3a40510f0/src/register.jl#L326 would need to be updated, along with Registrator.jl?

DilumAluthge commented 8 months ago

I think you would need the registry one first, no?

Oh, I was thinking that the manifest would be getting its info from a local Git clone. But yeah, if the plan is for the manifest to get the info from the registry, then we first need to implement this in the registry.

KristofferC commented 8 months ago

Makes sense to me to have optional metadata tied to versions that can be used to improve various tooling. You would have to verify that the commit metadata resolves to the correct tree, right?

simonbyrne commented 8 months ago

The commit info we can probably get from registrator. But I was thinking we could have a cron job that periodically queries the repos and updates the registry as required.

ericphanson commented 7 months ago

In the case of subpackages, we can have an optional git-tree-path giving the path in the commit/tag to the corresponding tree.

We do have subdir that is similar but that's one subdir per package, rather than per version

simonbyrne commented 7 months ago

We do have subdir that is similar but that's one subdir per package, rather than per version

Perhaps I should rename it for consistency? Should I just call it subdir as well?

ericphanson commented 7 months ago

Yeah, maybe subdir should just move to be per version? Or have both for awhile until all supported Pkg’s know the new location

simonbyrne commented 7 months ago

Yeah, maybe subdir should just move to be per version? Or have both for awhile until all supported Pkg’s know the new location

We probably need to keep a global one for non-released versions (e.g. a specific git commit)

GunnarFarneback commented 7 months ago

I'd like to better understand the use cases for the different pieces of information.

GunnarFarneback commented 7 months ago

Trick question: What is the tooling supposed to do if the same package is found in multiple registries, with diverging values for the optional fields?

simonbyrne commented 7 months ago
  • git-tree-path: Presumably needed to create source code links and taking height for the possibility that packages are moved around in the repository so the global subdir is insufficient?

Yes, exactly.

  • git-tag-name: For what purpose is it useful to have this information in the registry?

Two reasons

Trick question: What is the tooling supposed to do if the same package is found in multiple registries, with diverging values for the optional fields?

Pick the first one? In general, it shouldn't matter, only the tree hashes should, the optional fields are just there to help find the trees.

GunnarFarneback commented 7 months ago

tags can have information that the commit does not (e.g. annotated tags can contain release notes or signatures)

What is the intended workflow to get the tag names into the General registry? I can see two possibilities:

  1. Registrator directly writes a tag name, which may or may not materialize in the package repository depending on whether TagBot is activated and runs successfully.
  2. Registrator doesn't write the tag name, instead TagBot makes a new PR to General to add this information, once it has made the tag in the package repository.

Neither of those options seems great, so I hope I've missed some better approach.

GitHub lets you link to revisions via tags, which gives "nicer" URLs

A shorter URL is certainly nicer than a longer one, but it seems like a marginal win compared to the increased size of the registry, the logistics around syncing registry tag information and package repository tags, and the possibility that the nicer link suddenly breaks if someone mistakenly deletes a non-annotated tag.

the optional fields are just there to help find the trees.

This sounds contrary to the use case of investigating annotated tags.

simonbyrne commented 7 months ago
  • How would the tooling make use of signatures?

    • If we want to have tooling around release notes, digging them out from annotated tags seems like the wrong way. Much better would be to have them in a file in the repository, in some standardized format.

I don't have too many thoughts on what sort of tooling this could be useful for, but some other reasons it is useful to have tags:

What is the intended workflow to get the tag names into the General registry? I can see two possibilities:

  1. Registrator directly writes a tag name, which may or may not materialize in the package repository depending on whether TagBot is activated and runs successfully.

  2. Registrator doesn't write the tag name, instead TagBot makes a new PR to General to add this information, once it has made the tag in the package repository.

Neither of those options seems great, so I hope I've missed some better approach.

This I don't have a good answer to yet. One other option would be to have a semi-regular job (say weekly), which goes through and verifies:

  1. the commit hashes exist and point to the appropriate tree
  2. the tags exist and point to the appropriate commits

and if any updates are required, open a PR against the registry.

A shorter URL is certainly nicer than a longer one, but it seems like a marginal win compared to the increased size of the registry, the logistics around syncing registry tag information and package repository tags, and the possibility that the nicer link suddenly breaks if someone mistakenly deletes a non-annotated tag.

I don't think the size will increase too much: it's 1 extra field per version, this is dwarfed by the compat information per version.

As for breaking things: my suspicion tags are likely to be more stable than commit hashes (e.g. if you rewrite history to remove an intermediate commit, you can still keep the same tag names, but commit hashes will change). It is up to users what they want to use it for, but they shouldn't expect either commit or tags to be completely immutable over time.

GunnarFarneback commented 7 months ago

but some other reasons it is useful to have tags:

Fair enough, those sound like decent arguments.

I don't think the size will increase too much: it's 1 extra field per version, this is dwarfed by the compat information per version.

That depends on the amount of dependencies and changes in dependencies, but a stronger argument is that the tag name info can be expected to compress really well. Luckily this is a testable hypothesis.

Starting from General registry tree hash 793278ad7a09a821cfac38e86fc150f6c9a00f7f (current about an hour ago), this has a size of 7063293 bytes from the package servers. Packing it up and repacking it with Tar.create produces a tar file of the same size as the uncompressed package server tarball. Compressing it with gzip -9 gives a size of 7093553. This is slightly worse than on the package servers, but I won't bother trying to find exactly how those are computed. For this purpose gzip -9 should be good enough to estimate the relative size increase.

Now adding random commit hashes to all Versions.toml files increases the compressed tarball size to 9798614 bytes. Additionally adding tag names (constructed as v0.5.6 etc.) increases the size to 10069487.

In summary adding commit hashes to all packages increases the registry size by 38% and also adding tag names by another 3%.

simonbyrne commented 7 months ago

Starting from General registry tree hash 793278ad7a09a821cfac38e86fc150f6c9a00f7f (current about an hour ago), this has a size of 7063293 bytes from the package servers. Packing it up and repacking it with Tar.create produces a tar file of the same size as the uncompressed package server tarball. Compressing it with gzip -9 gives a size of 7093553.

Wait, the gzip is larger? I know that hashes should not be compressible, but we store them in hex digits which should leave plenty of redundancy,

In summary adding commit hashes to all packages increases the registry size by 38% and also adding tag names by another 3%.

Thanks for trying this out: I guess ~40% increase in size is a reason to be hesitant. Personally, I feel it's worth it, but would understand if others feel otherwise.

simonbyrne commented 7 months ago

Wait, the gzip is larger? I know that hashes should not be compressible, but we store them in hex digits which should leave plenty of redundancy,

Actually that doesn't seem right:

➜  reg ls -l -h
total 127512
-rw-r--r--@ 1 simon  staff    55M Jan  5 09:51 793278ad7a09a821cfac38e86fc150f6c9a00f7f.tar
-rw-r--r--@ 1 simon  staff   7.0M Jan  5 09:51 793278ad7a09a821cfac38e86fc150f6c9a00f7f.tar.gz
drwxr-xr-x@ 3 simon  staff    96B Jan  5 09:51 General
➜  reg du -sh General
166M    General
➜  reg du -sh -B 1 -A General
 18M    General
simonbyrne commented 7 months ago

Honestly, we may want to consider some sort of lightweight database to store this information: the disk usage of all these small files is getting pretty big.

simonbyrne commented 7 months ago

Or switch to xzip: it gives a 4.5MB file.

GunnarFarneback commented 7 months ago

Starting from General registry tree hash 793278ad7a09a821cfac38e86fc150f6c9a00f7f (current about an hour ago), this has a size of 7063293 bytes from the package servers. Packing it up and repacking it with Tar.create produces a tar file of the same size as the uncompressed package server tarball. Compressing it with gzip -9 gives a size of 7093553.

Wait, the gzip is larger? I know that hashes should not be compressible, but we store them in hex digits which should leave plenty of redundancy,

No, what I'm saying is that gzip -9 gives a (slightly) worse compression result than whatever is used to compress the tarballs on the package server. The uncompressed original tar file is 58 MB.

GunnarFarneback commented 7 months ago

I don't have a strong opinion whether this information is worth the size increase. Or rather, I do have concerns about the size, and I have in the past had timeout issues with the General registry on a company internal package server. But I also see a value in the added information.

KristofferC commented 7 months ago

Honestly, we may want to consider some sort of lightweight database to store this information: the disk usage of all these small files is getting pretty big.

But we only decompress it in memory so?

GunnarFarneback commented 7 months ago

You probably mean decompress.

The compressed tarball size is what matters for disk storage per installation, registry download size, and the registry part of the package server load. The decompressed tar file size matters for the in memory handling of the registry. The unpacked disk size only matters for those of us who like to look manually at the registry files or grep through them, or do other non-standard operations.