getsentry / craft

The universal Sentry release CLI 🚀
MIT License
133 stars 15 forks source link

GH Artifact lookup times out #482

Open mydea opened 1 year ago

mydea commented 1 year ago

It happens repeatedly for us that publishes time out when looking for artifacts uploaded to github. One example: https://github.com/getsentry/publish/actions/runs/5715907847/job/15486244767

This happens basically on every sentry-javascript release if we approve it while the checks are still running, so it seems to me something is not right with the waiting. Maybe we need to extend the ARTIFACTS_POLLING_INTERVAL (currently 10s), or extend the MAX_TRIES (currently 3)?

asottile-sentry commented 1 year ago

from the logs it looks like craft sees your CI as successful before the artifact upload job has started -- once it sees success it moves on to trying to locate artifacts from the successful CI

you can configure the necessary jobs here

mydea commented 1 year ago

Ah, I see, so basically this is completely wrong the way it is set up currently 😅

So I need to add this https://github.com/getsentry/sentry-javascript/actions/runs/5715683859/job/15486522728 as the required context, so:

statusProvider:
  name: github
  config:
    contexts:
      - All required tests passed or skipped
Lms24 commented 1 year ago

@asottile-sentry can you clarify what contexts should be set to? We tried the name property of the job but that's not it as craft tells us it can't find the context.

For reference, here's the job we want to wait on: https://github.com/getsentry/sentry-javascript/blob/754b08279c171ab730224d2432e2e89327fc5932/.github/workflows/build.yml#L858-L873

Lms24 commented 1 year ago

Or is it enough to just set without any context?

statusProvider:
  name: github
asottile-sentry commented 1 year ago

I am not sure -- I'm just going by the docs :S

BYK commented 1 year ago

@Lms24 newest versions of Craft use github as the default status provider. Let me look into your runs.

BYK commented 1 year ago

Very first step would be to change this line: https://github.com/getsentry/sentry-javascript/blob/99e347907812873cc0c832aa25968c3db6587af1/.craft.yml#L1

To say 1.4.2 instead of 0.23.1. I don't remember when we change the defaults for the context but the more recent version the better.

Moreover, the issue you originally reported doesn't seem to be related to contexts. I'll dig a bit more and report back here.

BYK commented 1 year ago

If you look at the logs starting from here: https://github.com/getsentry/publish/actions/runs/5715907847/job/15486244767#step:10:80

It actually just waits for all the checks to pass. The context parameter is to give it an explicit list of contexts to pass (so you don't have to wait for all).

BYK commented 1 year ago

So here in the artifact upload job for the failed publish, it says it successfully uploaded the artifacts at 23 past the hour: https://github.com/getsentry/sentry-javascript/actions/runs/5715683859/job/15485977459#step:6:970

The publish job fails at 27 minutes past the hour: https://github.com/getsentry/publish/actions/runs/5715907847/job/15486244767#step:10:133

That means for some reason GitHub API did not make the uploaded artifacts available for about 4 minutes. That's quite long. It maybe because the artifacts seem quite large.

I googled a bit to see if there's any mention of an expected delay for artifacts to be available but failed to find one. You may wanna raise this issue with GitHub support for investigation. In the meantime, either increasing the number of tries or extending the delay as you originally suggested makes sense for a workaround.

Lms24 commented 1 year ago

@BYK thanks for looking into this!

To say 1.4.2 instead of 0.23.1. I don't remember when we change the defaults for the context but the more recent version the better.

I'm confused now. I thought we just always use the latest craft version when starting a release via getsentry/publish? We only recently were blocked by a bug in the latest craft version. So why would it take an older version just for the status provider?

It actually just waits for all the checks to pass. The context parameter is to give it an explicit list of contexts to pass (so you don't have to wait for all).

Ok, so this, plus your last reply, plus the fact that GH is already the default status provider, suggests to me that our config doesn't need adjustments but it's a GH problem 🤔 The weird thing here is that our initial publish attempts fail often. Really often. I'm wondering how often that's overall because of test flakes vs. this artifact retrieval problem.

In the meantime, either increasing the number of tries or extending the delay as you originally suggested makes sense for a workaround.

Still seems like a good idea to me.

BYK commented 1 year ago

My pleasure!

I'm confused now. I thought we just always use the latest craft version when starting a release via getsentry/publish? We only recently were blocked by a bug in the latest craft version. So why would it take an older version just for the status provider?

Sorry for the confusion I caused. Some years ago, we did not default to GitHub as the status provider. That was changed around version 0.21.0 and we added a check: if the config file mentioned a version earlier than 0.21.0, don't default to anything. For anything else, use GitHub. So this file was already mentioning 0.23.1, hence past that version, hence nothing to change or worry about. Moreover, that epoch check was removed from the code a long time ago too 😅

Ok, so this, plus your last reply, plus the fact that GH is already the default status provider, suggests to me that our config doesn't need adjustments but it's a GH problem 🤔

Exactly! If you had an issue with your config, you'd have known about it a very long time ago. .craft.yml does not seem to get many updates.

The weird thing here is that our initial publish attempts fail often. Really often. I'm wondering how often that's overall because of test flakes vs. this artifact retrieval problem.

That is very easy to check and quantify. Just go check the logs and see why they fail. This one you linked was due to GitHub not making artifacts available quickly enough. If this is frequent, I'd blame that as if your tests fail I think you get another fail email for the failed workflow on the release branch.

Lms24 commented 1 year ago

The same thing happened again. Opened #484 to fix this. I spent too much time this week trying to fix releases. This needs to be improved.

BYK commented 1 year ago

The artifacts however are available here: https://github.com/getsentry/sentry-javascript/actions/runs/5752023602

I'd definitely raise this with GitHub Support. Again, I suspect it may have todo something with the number of files or the size of the unzipped artifacts.

Lms24 commented 1 year ago

Hmm I'm not sure if we're missing something here. I think there might still be some problem in the status provider. Looking at the run I linked, we can see the following time stamps:

So we started downloading artifacts before the action even completed.

Shouldn't the status provider in Craft only start downloading artifacts once the entire action completed (by default)? I have a feeling that the artifacts might only be available once the action completed. Does this make sense?

BYK commented 1 year ago

Shouldn't the status provider in Craft only start downloading artifacts once the entire action completed (by default)? I have a feeling that the artifacts might only be available once the action completed. Does this make sense?

Aha, that makes sense! Then I guess calculating the combined status of the commit step may some issues. That part was always a bit tricky. Relevant code is here:

https://github.com/getsentry/craft/blob/58d5b3c10fde734b60eb7752712352a2993af86b/src/status_providers/github.ts#L87-L114

We may wanna do some digging into the API and modernize that code, especially due to the "legacy checks" part.

asottile-sentry commented 1 year ago

@Lms24 it seems in that case then a job would need to be added to your workflow which requires all others to finish -- then you'd use that as the indicator for craft's status

Lms24 commented 1 year ago

I guess it's worth trying as long as sentry-javascript is the only affected repo. But it means we're adapting to broken behaviour.

asottile-sentry commented 1 year ago

I think what's happening is there's a point where everything is passed (all statuses green, none pending) and craft can't know that there's more things to run

Lms24 commented 1 year ago

Soo, I just tried to set

statusProvider:
  name: github
  config:
    contexts:
      - job_required_jobs_passed

but the release failed again.

I'm going to try jobs.job_required_jobs_passed next which should be a valid GH context but I have no idea if this is even working correctly.

When I query the API endpoint manually with

gh api \
  -H "Accept: application/vnd.github+json" \
  -H "X-GitHub-Api-Version: 2022-11-28" \
  /repos/getsentry/sentry-javascript/commits/583e0bd53968b56d0f63c132c32b989dbcb999cb/status

I get an empty array for statuses which should contain the context we're waiting on.

{
  "state": "pending",
  "statuses": [],
  // ...
}

Not sure if this endpoint even works correctly. For instance, it always returns pending and this was already reported here.

Lms24 commented 1 year ago

Update: I tried jobs.job_required_jobs_passed in dry run mode and it didn't work either. Not sure what's happening here but if I had to guess that endpoint seems broken.

Lms24 commented 1 year ago

Given that there's no way to make this work with our current status checks:

I'm wondering if we should use the Actions endpoint instead of the commit based endpoints: https://docs.github.com/en/rest/actions/workflow-jobs?apiVersion=2022-11-28#get-a-job-for-a-workflow-run

Seems like this is what we need to poll the status of a specific job. Now what's left to figure out is

BYK commented 1 year ago

Actions endpoint should it be used here as that's specific to GitHub actions but the GitHub status provider (and GitHub status checks) are independent from GitHub Actions.

You may wanna try graphql endpoints.

Lms24 commented 1 year ago

Alright folks, I'm done with trying to fix this. Given that there's a somewhat working but annoying workaround (namely, telling our managers to wait ~30min with adding the accept label in getsentry/publish) I can't justify working on this any longer. My gut feeling tells me that we can't be the only repo that's affected by this but what can I do...

To whoever picks this up: If we can't easily change the current github status provider, maybe the path of least friction/resistance is to create a second Github (Action) status provider that uses working endpoints. Individual repos could opt-in to use this provider instead and we wouldn't risk breaking existing publishing configs who rely on the older endpoints (if there are any).