fluxcd / flux

Successor: https://github.com/fluxcd/flux2
https://fluxcd.io
Apache License 2.0
6.9k stars 1.08k forks source link

🎂 `fluxctl sync` fails when more than 1 commit shares the first 7 letters of commit hash #3517

Closed dtinth closed 2 years ago

dtinth commented 3 years ago

We at @taskworld got this error while running fluxctl sync.

+ fluxctl sync --k8s-fwd-ns=[redacted]
Synchronizing with ssh://git@github.com/[redacted]/[redacted].git
Revision of master to apply is f1965f6

Waiting for ff965f6 to be applied ...
== Error ==

Error: short SHA1 ff965f6 is ambiguous, full output:
 error: short SHA1 ff965f6 is ambiguous
hint: The candidates are:
hint:   ff965f65e commit 2021-07-29 - [redacted]
hint:   ff965f6d5 commit 2020-11-18 - Auto-release [redacted]/[redacted]:[redacted]
fatal: bad revision 'ff965f65e8b8f689527b5557cd4c14[redacted]..ff965f6'

We don't have a specific help message for the error above.

It would help us remedy this if you log an issue at

    https://github.com/fluxcd/flux/issues

saying what you were doing when you saw this, and quoting the message
at the top.

It seems like fluxctl has been hardcoded to use only the first 7 letters of the commit name.

https://github.com/fluxcd/flux/blob/2fd24cc5d909930c7745ec884d9060f1239b7291/cmd/fluxctl/sync_cmd.go#L75-L82

Over 1.5 year our FluxCD-managed git repository has 10000+ commits. Using the 🎂 birthday paradox formula, there is ~17% chance that at least one 7-character-long prefix matches more than one commit. I would suggest making this a bit longer, or just use the full commit hash.

kingdonb commented 3 years ago

Thank you for the report. This looks like a duplicate of #3322 (where I suggested going out and buying a lottery ticket.)

It looks like you've identified the code that needs to change. I would welcome a PR, but also am obligated to recommend you consider upgrading to Flux v2 as the FAQ mentions scaling issues as a primary reason for moving on to a new design with breaking changes: https://fluxcd.io/legacy/flux/faq/#migrate-to-flux-v2

This is not the thing we typically consider when we think of scaling issues, but it is certainly a scaling issue in a way.

I don't think increasing the length of the slice used from the Revision string would be considered a breaking change, but I will have to run it by the maintainers. We are meeting soon. I think this can be addressed before Flux 1.24.0 is released.

Thanks again for bringing this to our attention and adding some further color and detail to the existing report.

kingdonb commented 3 years ago

I will push a tag with a change from 7 characters to 12, that you can use to test if we can resolve this without creating other issues (regardless of whether it goes in a PATCH release or a MINOR release, you probably want to fix this right away...)

Actually, I see the issue you've identified is in fluxctl itself... have you been able to compile a new version already, (or will this be of any help if I can provide it as a pre-release?)

There is some time, probably about a month, before Flux 1.24.0 release is scheduled to come out, according to the milestones I set in this repo here, primarily chosen to be realistic and allow enough time for other potentially important issues to resurface from Flux v1 users and their reports from Flux instances in the wild. Thanks again for your report!

kingdonb commented 3 years ago

Besides fluxctl, the [:7] also surfaces in ShortRevision and in func (e Event) String() string, in the event package

I see event is used in daemon and sync.go, also a part of daemon. This is not surprising.

It will probably require both, a new Flux image and a new fluxctl binary, to resolve this. (I'm guessing they should match.)

I believe that means this is necessarily a MINOR change, since fluxd and fluxctl releases in the same minor series would most likely be expected to work together without issue, even if there was patch skew.

kingdonb commented 3 years ago

There is #3519 (https://github.com/fluxcd/flux/tree/fix-too-short-revision-hash) which you can build from, or I'll provide.

Once this gets merged into master, a prerelease image will be published in one of the fluxcd repos, for flux prereleases. I do not have a repo with a collision to use for testing against, although I'm sure it wouldn't be too difficult to write a test that covers this issue, ... I'm assuming since you've redacted parts of the report that you can't necessarily share your test repo, but maybe you can adopt this image and fluxctl binary, to provide confirmation that this fixes the issue for you.

I would suggest waiting until e2e tests have all passed on #3519 before giving it a try.

docker.io/kingdonb/flux:fix-too-short-revision-hash-cb71e6a3 is built and pushed now.

I'm not sure what architecture and OS you need for fluxctl, but I've just went ahead and built them all:

$ make build-fluxctl
for arch in amd64; do \
        for os in linux darwin windows; do
...
$ ls -l build/fluxctl_*
-rwxr-xr-x  1 kingdonb  staff  42264496 Jul 29 14:29 build/fluxctl_darwin_amd64
-rwxr-xr-x  1 kingdonb  staff  42602282 Jul 29 14:29 build/fluxctl_linux_amd64
-rwxr-xr-x  1 kingdonb  staff  36552868 Jul 29 14:31 build/fluxctl_linux_arm
-rwxr-xr-x  1 kingdonb  staff  40317705 Jul 29 14:31 build/fluxctl_linux_arm64
-rwxr-xr-x  1 kingdonb  staff  43205632 Jul 29 14:30 build/fluxctl_windows_amd64

Please let me know how else I can help!

kingdonb commented 3 years ago

I've got to report, I'm afraid I've been nerd-sniped.

I did the birthday problem in University calculus and approximations, but I'm not sure how to do this math problem, to see how many commits result in what likelihood of collision ... I got to the point where I said "if there are 7 character width and 16 hex digits, that makes 57,657,600 possible birthdays in each 'year' ... " and I'm afraid Wolfram Alpha won't take me through a calculation with digit counts that high.

If you're willing to retrace your math here to come up with a figure of ~17%, and help me establish that 12 digits provides a more reasonable confidence interval / and up to what scale it starts breaking down again, I'd be very glad to not have to figure out again exactly how to make the calculation for myself. (I've been having nightmares about going back to college and this frankly is not helping, LOL.)

The tests on #3519 appear to have all passed. You can reach me on #flux in the CNCF slack if you want any kind of help to get the matching fluxctl binary.

dtinth commented 3 years ago

Actually, I see the issue you've identified is in fluxctl itself... have you been able to compile a new version already, (or will this be of any help if I can provide it as a pre-release?)

I currently do not have access to the FluxCD installation, and given that we are using this in production and realized that the problem can be easily worked around by pushing a new commit to the repo (so that fluxctl sync sees a different, non-colliding hash), I'm afraid having this case fixed is probably not a big priority for us, and thus I may not be able to help with testing this. Sorry.

I would welcome a PR, but also am obligated to recommend you consider upgrading to Flux v2 as the FAQ mentions scaling issues as a primary reason for moving on to a new design with breaking changes

I will let my colleague know about this. Thank you.

If you're willing to retrace your math here to come up with a figure of ~17%, and help me establish that 12 digits provides a more reasonable confidence interval / and up to what scale it starts breaking down again, I'd be very glad to not have to figure out again exactly how to make the calculation for myself.

This is the calculation script:

from decimal import *

prefix_length = 7
commits_count = 10000

n = 16 ** prefix_length
c = n
pr = Decimal(1)
for _ in range(commits_count):
    pr *= Decimal(c) / Decimal(n)
    c -= 1

print(str((1 - pr) * 100) + '%')

Output is:

16.99324542926372482290564190%

Here’s the probability chart for 10000, 100000, and 1000000 commits.

image

I would say that your suggested prefix size of 12 is fine. Given a new commit hash every five minutes, it would take about 22 years until the probability of a collision exceeds 1%.

kingdonb commented 3 years ago

This is awesome. I don't know whether we will fix this issue, since the workaround seems pretty trivial (thanks for sharing that as well.) But the fix is in the PR, and it will be considered for 1.24.0, which would be the next MINOR release of Flux v1.

Thanks for the clarification on the math and also for explaining the workaround that can be used.

kingdonb commented 3 years ago

It has just been pointed out to me, that Flux v1 is formally superseded now, since the Flux v2 APIs have been declared as stable: https://fluxcd.io/docs/migration/timetable/

There may not be a Flux v1.24.0. I will keep this open until I have further clarity, but thanks again for communicating about the issue!

pjbgf commented 2 years ago

This project is in Migration and security support only, so unfortunately this issue won't be fixed. We recommend users to migrate to Flux 2 at their earliest convenience.

More information about the Flux 2 transition timetable can be found at: https://fluxcd.io/docs/migration/timetable/.