Closed dotnet-eng-status[bot] closed 1 year ago
Taking a look
This is just fallout from https://github.com/dotnet/arcade/issues/11898. Specifically, the signing certificates and provisioning profiles of the Apple TV devices expired today. Unfortunately this alert will stay active until this is sorted, since following the instructions did not solve the problem.
:green_heart: Metric state changed to ok
Description and instructions for this alert
Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.
Actually I believe it DID solve the problem, rather I didn't realize it fetches the mobile device profile files at build time, so replaying existing jobs didn't work. Closing as fixed.
Re "Specifically, the signing certificates and provisioning profiles of the Apple TV devices expired today" - were we caught by surprise by this or we have the process to renew the certificates preemptively? /cc @premun
We had a fake secret set up so that we would get notified in advance about the expiration but we never did. We think the secret might have gotten removed accidentally (maybe during the secret sweep'n'clean epic?).
@MattGal can we set up a secret for next year?
I remember us having the discussion about this when we were closing the mobile space epic so I'd really like to understand how exactly did we miss this in order to improve the process. Do we have this described somewhere (incl. the exact dummy secret used)? If not, what would be the best place to describe it?
I think the best place is to put an instruction in the guide for secret renewal: https://dev.azure.com/dnceng/internal/_wiki/wikis/DNCEng%20Services%20Wiki/862/Apple-(iOS-tvOS)-signing-certificates?anchor=generating-a-new-certificate-%26-provisioning-profiles
Last step could be "Create a secret XY in keyvault Z" or similar.
I cannot find the issue for this, it's somewhere in core-eng I think. Or possibly moved in arcade but not tied to the epic anymore.
I don't think fake secrets for the sake of alerting is a good practice. Have we considered implementing something to help with the cycling that doesn't require manual sending of helix jobs for example? Can it be something we deploy to the machines as part of the work ddfun does when setting them up?
We had a fake secret set up so that we would get notified in advance about the expiration but we never did. We think the secret might have gotten removed accidentally (maybe during the secret sweep'n'clean epic?).
@MattGal Matt Galbraith FTE can we set up a secret for next year?
The secret sweep and clean epic hasn't removed any secrets. If there was a fake unused secret and someone found it and disabled it, that's yet another reason not to rely on dummy secrets.
For some secrets like service connections where we haven't hooked up any automated way to perform secret cycling, we have started creating meeting invites to remind when the PATs that back them would need cycling. I would even prefer that approach to the dummy secret.
Have we considered implementing something to help with the cycling that doesn't require manual sending of helix jobs for example? Can it be something we deploy to the machines as part of the work ddfun does when setting them up?
@riarenas as far as I know we don't have any mechanism to update artifacts on OnPrem machines (outside of asking DDFUN to manually perform them).
These device queues are usually empty so this has been an easy enough task so far, we use the same for Xcode updates but it is not ideal.. It doesn't take much though and it's easy to verify which machines have got the updates. But I realize it's not ideal and would be nice to have a different mechanism for that.
For some secrets like service connections where we haven't hooked up any automated way to perform secret cycling, we have started creating meeting invites to remind when the PATs that back them would need cycling.
Yup, that sounds good.
For some secrets like service connections where we haven't hooked up any automated way to perform secret cycling, we have started creating meeting invites to remind when the PATs that back them would need cycling. I would even prefer that approach to the dummy secret.
I wasn't aware of this implementation. It will probably work well, but I see some drawbacks with this approach:
We might have another need to address periodic renewal in dotnet/arcade-services#2355 if ESRP is not able to take care of the GPG keys so it seems we need to think about a systemic solution for these cases.
Agreed. I don't like that approach either, but I prefer it over the dummy secret that shouldn't be in a key vault in the first place.
Ideally, everything related to secret management should be automated via secret-manager. Any secrets that need renewal would be in a secret manifest, and would have clear instructions of how that type of secret is cycled whenever it's not possible to do it with automation. That's why I'm being hard on folks trying to add new secret types without a proper cycling plan such as the PGP keys.
This component is not new though, so I recognize that we weren't as strict with our secrets while the mobile devices epic was underway so I'd rather give the current alternatives rather than asking for a full on rework of how these certificates are handled.
:broken_heart: Metric state changed to alerting
Go to rule
@dotnet/dnceng, please investigate
Automation information below, do not change
Grafana-Automated-Alert-Id-d70761f3c7e84a6380e44943a2e583e6