dotnet / arcade

Tools that provide common build infrastructure for multiple .NET Foundation projects.
MIT License
671 stars 347 forks source link

Production - [Alerting] Apple device failure rate alert #11904

Closed dotnet-eng-status[bot] closed 1 year ago

dotnet-eng-status[bot] commented 1 year ago

:broken_heart: Metric state changed to alerting

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change Grafana-Automated-Alert-Id-d70761f3c7e84a6380e44943a2e583e6
ulisesh commented 1 year ago

Taking a look

MattGal commented 1 year ago

This is just fallout from https://github.com/dotnet/arcade/issues/11898. Specifically, the signing certificates and provisioning profiles of the Apple TV devices expired today. Unfortunately this alert will stay active until this is sorted, since following the instructions did not solve the problem.

dotnet-eng-status[bot] commented 1 year ago

:green_heart: Metric state changed to ok

Description and instructions for this alert

Please note that this alert will fire every 12 hours as the list of machines can change while the alert is alive. So please keep an eye on the list of machines in the comment.

Go to rule

MattGal commented 1 year ago

Actually I believe it DID solve the problem, rather I didn't realize it fetches the mobile device profile files at build time, so replaying existing jobs didn't work. Closing as fixed.

tkapin commented 1 year ago

Re "Specifically, the signing certificates and provisioning profiles of the Apple TV devices expired today" - were we caught by surprise by this or we have the process to renew the certificates preemptively? /cc @premun

premun commented 1 year ago

We had a fake secret set up so that we would get notified in advance about the expiration but we never did. We think the secret might have gotten removed accidentally (maybe during the secret sweep'n'clean epic?).

@MattGal can we set up a secret for next year?

tkapin commented 1 year ago

I remember us having the discussion about this when we were closing the mobile space epic so I'd really like to understand how exactly did we miss this in order to improve the process. Do we have this described somewhere (incl. the exact dummy secret used)? If not, what would be the best place to describe it?

premun commented 1 year ago

I think the best place is to put an instruction in the guide for secret renewal: https://dev.azure.com/dnceng/internal/_wiki/wikis/DNCEng%20Services%20Wiki/862/Apple-(iOS-tvOS)-signing-certificates?anchor=generating-a-new-certificate-%26-provisioning-profiles

Last step could be "Create a secret XY in keyvault Z" or similar.

I cannot find the issue for this, it's somewhere in core-eng I think. Or possibly moved in arcade but not tied to the epic anymore.

riarenas commented 1 year ago

I don't think fake secrets for the sake of alerting is a good practice. Have we considered implementing something to help with the cycling that doesn't require manual sending of helix jobs for example? Can it be something we deploy to the machines as part of the work ddfun does when setting them up?

riarenas commented 1 year ago

We had a fake secret set up so that we would get notified in advance about the expiration but we never did. We think the secret might have gotten removed accidentally (maybe during the secret sweep'n'clean epic?).

@MattGal Matt Galbraith FTE can we set up a secret for next year?

The secret sweep and clean epic hasn't removed any secrets. If there was a fake unused secret and someone found it and disabled it, that's yet another reason not to rely on dummy secrets.

riarenas commented 1 year ago

For some secrets like service connections where we haven't hooked up any automated way to perform secret cycling, we have started creating meeting invites to remind when the PATs that back them would need cycling. I would even prefer that approach to the dummy secret.

premun commented 1 year ago

Have we considered implementing something to help with the cycling that doesn't require manual sending of helix jobs for example? Can it be something we deploy to the machines as part of the work ddfun does when setting them up?

@riarenas as far as I know we don't have any mechanism to update artifacts on OnPrem machines (outside of asking DDFUN to manually perform them).

These device queues are usually empty so this has been an easy enough task so far, we use the same for Xcode updates but it is not ideal.. It doesn't take much though and it's easy to verify which machines have got the updates. But I realize it's not ideal and would be nice to have a different mechanism for that.

For some secrets like service connections where we haven't hooked up any automated way to perform secret cycling, we have started creating meeting invites to remind when the PATs that back them would need cycling.

Yup, that sounds good.

tkapin commented 1 year ago

For some secrets like service connections where we haven't hooked up any automated way to perform secret cycling, we have started creating meeting invites to remind when the PATs that back them would need cycling. I would even prefer that approach to the dummy secret.

I wasn't aware of this implementation. It will probably work well, but I see some drawbacks with this approach:

We might have another need to address periodic renewal in dotnet/arcade-services#2355 if ESRP is not able to take care of the GPG keys so it seems we need to think about a systemic solution for these cases.

riarenas commented 1 year ago

Agreed. I don't like that approach either, but I prefer it over the dummy secret that shouldn't be in a key vault in the first place.

Ideally, everything related to secret management should be automated via secret-manager. Any secrets that need renewal would be in a secret manifest, and would have clear instructions of how that type of secret is cycled whenever it's not possible to do it with automation. That's why I'm being hard on folks trying to add new secret types without a proper cycling plan such as the PGP keys.

This component is not new though, so I recognize that we weren't as strict with our secrets while the mobile devices epic was underway so I'd rather give the current alternatives rather than asking for a full on rework of how these certificates are handled.