Closed dtinth closed 2 years ago
Thank you for the report. This looks like a duplicate of #3322 (where I suggested going out and buying a lottery ticket.)
It looks like you've identified the code that needs to change. I would welcome a PR, but also am obligated to recommend you consider upgrading to Flux v2 as the FAQ mentions scaling issues as a primary reason for moving on to a new design with breaking changes: https://fluxcd.io/legacy/flux/faq/#migrate-to-flux-v2
This is not the thing we typically consider when we think of scaling issues, but it is certainly a scaling issue in a way.
I don't think increasing the length of the slice used from the Revision string would be considered a breaking change, but I will have to run it by the maintainers. We are meeting soon. I think this can be addressed before Flux 1.24.0 is released.
Thanks again for bringing this to our attention and adding some further color and detail to the existing report.
I will push a tag with a change from 7 characters to 12, that you can use to test if we can resolve this without creating other issues (regardless of whether it goes in a PATCH release or a MINOR release, you probably want to fix this right away...)
Actually, I see the issue you've identified is in fluxctl itself... have you been able to compile a new version already, (or will this be of any help if I can provide it as a pre-release?)
There is some time, probably about a month, before Flux 1.24.0 release is scheduled to come out, according to the milestones I set in this repo here, primarily chosen to be realistic and allow enough time for other potentially important issues to resurface from Flux v1 users and their reports from Flux instances in the wild. Thanks again for your report!
Besides fluxctl
, the [:7] also surfaces in ShortRevision
and in func (e Event) String() string
, in the event
package
I see event is used in daemon
and sync.go
, also a part of daemon
. This is not surprising.
It will probably require both, a new Flux image and a new fluxctl
binary, to resolve this. (I'm guessing they should match.)
I believe that means this is necessarily a MINOR change, since fluxd
and fluxctl
releases in the same minor series would most likely be expected to work together without issue, even if there was patch skew.
There is #3519 (https://github.com/fluxcd/flux/tree/fix-too-short-revision-hash) which you can build from, or I'll provide.
Once this gets merged into master, a prerelease image will be published in one of the fluxcd
repos, for flux prereleases. I do not have a repo with a collision to use for testing against, although I'm sure it wouldn't be too difficult to write a test that covers this issue, ... I'm assuming since you've redacted parts of the report that you can't necessarily share your test repo, but maybe you can adopt this image and fluxctl binary, to provide confirmation that this fixes the issue for you.
I would suggest waiting until e2e tests have all passed on #3519 before giving it a try.
docker.io/kingdonb/flux:fix-too-short-revision-hash-cb71e6a3
is built and pushed now.
I'm not sure what architecture and OS you need for fluxctl
, but I've just went ahead and built them all:
$ make build-fluxctl
for arch in amd64; do \
for os in linux darwin windows; do
...
$ ls -l build/fluxctl_*
-rwxr-xr-x 1 kingdonb staff 42264496 Jul 29 14:29 build/fluxctl_darwin_amd64
-rwxr-xr-x 1 kingdonb staff 42602282 Jul 29 14:29 build/fluxctl_linux_amd64
-rwxr-xr-x 1 kingdonb staff 36552868 Jul 29 14:31 build/fluxctl_linux_arm
-rwxr-xr-x 1 kingdonb staff 40317705 Jul 29 14:31 build/fluxctl_linux_arm64
-rwxr-xr-x 1 kingdonb staff 43205632 Jul 29 14:30 build/fluxctl_windows_amd64
Please let me know how else I can help!
I've got to report, I'm afraid I've been nerd-sniped.
I did the birthday problem in University calculus and approximations, but I'm not sure how to do this math problem, to see how many commits result in what likelihood of collision ... I got to the point where I said "if there are 7 character width and 16 hex digits, that makes 57,657,600 possible birthdays in each 'year' ... " and I'm afraid Wolfram Alpha won't take me through a calculation with digit counts that high.
If you're willing to retrace your math here to come up with a figure of ~17%, and help me establish that 12 digits provides a more reasonable confidence interval / and up to what scale it starts breaking down again, I'd be very glad to not have to figure out again exactly how to make the calculation for myself. (I've been having nightmares about going back to college and this frankly is not helping, LOL.)
The tests on #3519 appear to have all passed. You can reach me on #flux in the CNCF slack if you want any kind of help to get the matching fluxctl
binary.
Actually, I see the issue you've identified is in fluxctl itself... have you been able to compile a new version already, (or will this be of any help if I can provide it as a pre-release?)
I currently do not have access to the FluxCD installation, and given that we are using this in production and realized that the problem can be easily worked around by pushing a new commit to the repo (so that fluxctl sync
sees a different, non-colliding hash), I'm afraid having this case fixed is probably not a big priority for us, and thus I may not be able to help with testing this. Sorry.
I would welcome a PR, but also am obligated to recommend you consider upgrading to Flux v2 as the FAQ mentions scaling issues as a primary reason for moving on to a new design with breaking changes
I will let my colleague know about this. Thank you.
If you're willing to retrace your math here to come up with a figure of ~17%, and help me establish that 12 digits provides a more reasonable confidence interval / and up to what scale it starts breaking down again, I'd be very glad to not have to figure out again exactly how to make the calculation for myself.
This is the calculation script:
from decimal import *
prefix_length = 7
commits_count = 10000
n = 16 ** prefix_length
c = n
pr = Decimal(1)
for _ in range(commits_count):
pr *= Decimal(c) / Decimal(n)
c -= 1
print(str((1 - pr) * 100) + '%')
Output is:
16.99324542926372482290564190%
Here’s the probability chart for 10000, 100000, and 1000000 commits.
I would say that your suggested prefix size of 12 is fine. Given a new commit hash every five minutes, it would take about 22 years until the probability of a collision exceeds 1%.
This is awesome. I don't know whether we will fix this issue, since the workaround seems pretty trivial (thanks for sharing that as well.) But the fix is in the PR, and it will be considered for 1.24.0, which would be the next MINOR release of Flux v1.
Thanks for the clarification on the math and also for explaining the workaround that can be used.
It has just been pointed out to me, that Flux v1 is formally superseded now, since the Flux v2 APIs have been declared as stable: https://fluxcd.io/docs/migration/timetable/
There may not be a Flux v1.24.0. I will keep this open until I have further clarity, but thanks again for communicating about the issue!
This project is in Migration and security support only, so unfortunately this issue won't be fixed. We recommend users to migrate to Flux 2 at their earliest convenience.
More information about the Flux 2 transition timetable can be found at: https://fluxcd.io/docs/migration/timetable/.
We at @taskworld got this error while running
fluxctl sync
.It seems like
fluxctl
has been hardcoded to use only the first 7 letters of the commit name.https://github.com/fluxcd/flux/blob/2fd24cc5d909930c7745ec884d9060f1239b7291/cmd/fluxctl/sync_cmd.go#L75-L82
Over 1.5 year our FluxCD-managed git repository has 10000+ commits. Using the 🎂 birthday paradox formula, there is ~17% chance that at least one 7-character-long prefix matches more than one commit. I would suggest making this a bit longer, or just use the full commit hash.