JuliaRegistries / TagBot

Creates tags, releases, and changelogs for your Julia packages when they're registered
https://github.com/marketplace/actions/julia-tagbot
MIT License
91 stars 18 forks source link

Subdirectory support broken due to wrong tree hash verification #281

Closed sloede closed 11 months ago

sloede commented 12 months ago

TL;DR

If I am not mistaken, the current implementation for supporting subdirectory tagging is broken due to an apple-with-oranges-style comparison of the commit tree hash and the package tree hash, which are not identical for subdirectory tags.

Long version

The setup

After #279 had been merged, we tried to use TagBot to tag a new release of LibTrixi.jl, which lives in the LibTrixi.jl subdirectory of https://github.com/trixi-framework/libtrixi. Correspondingly, we have configured TagBot with subdir: LibTrixi.jl.

The error message

However, during its run, TagBot quit early with the following messages:

Warning: Tree SHA of commit from registry PR does not match
Warning: No matching commit was found for version v0.1.2 (7911434c8b808550e3bd6c321dd50da91b128a9e)

(see this log message).

The origin

The "Tree SHA ... does not match" error comes from the following snippet: https://github.com/JuliaRegistries/TagBot/blob/7ac587806eabe2831c819161afa2eb81834c9c43/tagbot/action/repo.py#L258-L263 where m[1] holds the first 32 characters of the commit hash that has been extracted from the PR body of the corresponding PR to https://github.com/JuliaRegistries/General. In our case, it got the commit hash 6e615c91364595e8ad1feca67ebfbb39 from https://github.com/JuliaRegistries/General/pull/91262#issue-1892562735.

As an additional safety measure (or maybe it is required to catch ambiguities, but I cannot see where they might come from), in line 259 the tree hash of the registered commit (commit.commit.tree.sha) is compared against the tree hash stored in the Julia registry for the corresponding package version. In our case, the version is v0.1.2, and TagBot extracted the package tree hash 7911434c8b808550e3bd6c321dd50da91b128a9e from this file.

The bug (?)

And this is where the problem is located: The package tree hash 7911... is obtained by computing the git tree hash of the package directory, which in our case is the subdirectory LibTrixi.jl. The commit tree hash, against it is compared here, however, is computed on the entire repository.

To quickly verify that they are different, you can check the different strings from the libtrixi repo:

git clone git@github.com:trixi-framework/libtrixi.git
cd libtrixi
git show --name-only 6e615c91364595e8ad1feca67ebfbb391c4ede00 --pretty=raw | grep ^tree | awk '{print "commit tree hash: ", $2}'
echo "package tree hash: $(git rev-parse HEAD:LibTrixi.jl)"

which will output something like

commit tree hash:  26cb3d6465884c915ee97f20e4d6c978338782f5
package tree hash: 7911434c8b808550e3bd6c321dd50da91b128a9e

Obviously, this comparison of the commit tree hash and the package tree hash is bound to fail whenever the package lives in a subdirectory of the repository.

The solution

Currently, I see no obvious way out of this: Relying on the tree hash to determine the appropriate commit hash works only if the tree hash was computed for the entire repository, as otherwise there is no unique tree hash -> commit hash relation anymore. Since we are not interested in the content here (in which case that would be fine) but in the unique commit, this mechanism does fundamentally not work for subdirectory tags.

The only viable option I see atm is to remove the fallback mechanism for the commit identification and solely rely on the registry PR information. If this is not an option, one could consider doing this only for subdirectory tags and to leave the existing machinery in place for full-repo tags. Removing the (currently usuable) subdirectory support is imho not an option.

What are your thoughts on this @christopher-dG @IanButterworth (and maybe @hannahilea)?

cc @bgeihe

christopher-dG commented 12 months ago

Right, that makes sense and that was an oversight when support was initially added.

It looks like if we want to do this right, we'd need to implement tree hash computation from scratch, as I don't think the Git CLI provides a way to do so.

sloede commented 12 months ago

It looks like if we want to do this right, we'd need to implement tree hash computation from scratch

I am not sure that this is a solution though:

The problem is in _filter_map_versions, https://github.com/JuliaRegistries/TagBot/blob/7ac587806eabe2831c819161afa2eb81834c9c43/tagbot/action/repo.py#L311 which is called from new_versions and which takes a dict of "version" -> "tree hash" with all Julia package versions that have been registered within the last lookback days. _filter_map_versions is then supposed to convert the tree hash to the corresponding commit hash.

But that's where I see the issue that we cannot easily work around: This conversion only works if there is a unique tree hash -> commit hash mapping. However, for a subdirectory package, the tree hash of the subdirectory does not change if the subdirectory contents are left untouched. That is, there are potentially a lot of git commits that have a common tree hash for (some of) their subdirectories.

To summarize, unless I am missing something (which might very well be the case), there is just no way to uniquely identify a git commit through a tree hash of a subdirectory.

Therefore, I think to retain the ability for TagBot to tag subdirectory packages (actually: to actually make it work in the first place), the whole version identification mechanism has to be rethought (and reimplemented). Since the Julia registry does not record commit hashes for its versions in the repository (only tree hashes), literally the only place where we have a unique connection between a Julia package version and its commit hash is currently the JuliaRegistrator PR.

Thus, we would either need to convince the Julia registry to start recording commit hashes for new package versions together with the tree hash. Alternatively, we could limit the subdirectory tagging (or TagBot in its entirety) to work only with versions that have been registered through the JuliaRegistrator, which creates the PRs with the commit hash information in an expected format.

ericphanson commented 12 months ago

To double check: is this check a consistency check (verify TagBot is running on the right commit, which it should be doing bc of some other mechanism), OR is this to figure out what commit to run on at all (which commit to tag)?

If it is a consistency check, then for subdirs my thought would be: as long as the commit in question indeed contains the right subdir hash (even if there are other commits that also contain that subdir), then we are consistent and the check can pass.

sloede commented 12 months ago

is this check a consistency check (verify TagBot is running on the right commit, which it should be doing bc of some other mechanism), OR is this to figure out what commit to run on at all (which commit to tag)?

In general, a little bit of both. In _filter_map_versions, https://github.com/JuliaRegistries/TagBot/blob/7ac587806eabe2831c819161afa2eb81834c9c43/tagbot/action/repo.py#L311-L334 TagBot first tries to get the commit hash from the registry PR using _commit_sha_from_registry_pr in line 316. It currently fails here because of the tree hash mismatch (as described above), but otherwise would be able to get the commit hash we need.

However, if the PR approach fails, TagBot falls back to using _commit_sha_of_tree in line 318. This actually tries to get the commit hash by comparing the tree hash against the tree hashes of previous commits. It is here where only using a tree hash to find the corresponding commit fundamentally fails, since it is not a unique relationship anymore.

However, we could follow your idea,

as long as the commit in question indeed contains the right subdir hash (even if there are other commits that also contain that subdir), then we are consistent and the check can pass.

and do the following: If a relevant PR was found and and a commit hash was successfully extracted and subdir is non-empty, we compare the subdir tree hash for the found commit hash against the package version tree hash. If they match, we can be reasonably sure that everything is OK and proceed with tagging.

This would make the subdir feature more restrictive (it would only work for JuliaRegistrator-triggered version registrations, not for manual PRs), but that should cover >95% (my personal estimate) of all use cases in practice. The only remaining difficulty would be here to actually obtain the tree hash, but this could probably be done through _git.

sloede commented 12 months ago

I've created https://github.com/JuliaRegistries/TagBot/pull/282 which implements the suggested solution.