Strider-CD / strider

Open Source Continuous Integration & Deployment Server
http://strider-cd.github.io/
4.6k stars 432 forks source link

Long prepare phase, refresh brings the server to a halt #919

Open knownasilya opened 8 years ago

knownasilya commented 8 years ago

I've noticed this happening sometimes:

  1. Multiple commits come in
  2. I cancel the first ones so I don't have to wait twice as long
  3. The remaining job sometimes gets stuck in "prepare" phase.
  4. If I refresh the page now, It won't come back to the page, but I'll get a gateway timeout. Have to restart the server.

I'm not sure why it halts in "prepare", seems like that's the culprit here, and maybe an error isn't being handled correctly.

SimonKaluza commented 7 years ago

My Strider instance HTTP server also starts timing out if multiple commit hooks come in at the same time and multiple projects are built. After a couple minutes it eventually seems to come back online (without a restart), but I'll see several 502s from my reverse proxy in the meantime if I try to load the Strider dashboard or send another commit webhook within those several minutes.

I'm not sure where the delay is, but I'm especially curious what's blocking the HTTP thread, I thought most of the prepare phase would be delegated to the workers?

knownasilya commented 7 years ago

Did you enable concurrent builds? That should help with multiple projects

SimonKaluza commented 7 years ago

@knownasilya yeah I'm at CONCURRENT_JOBS=4. Would that affect the Strider HTTP server though? I don't mind waiting for the jobs to complete, the problem is that some of the GitHub webhooks are being dropped due to timeouts.

knownasilya commented 7 years ago

That's weird, maybe the timeout isn't sufficient for your proxy? The webhooks respond back to github almost instantly, once the job has been scheduled.

SimonKaluza commented 7 years ago

I verified it's not problem with my reverse proxy by running curl localhost:3000 immediately after a project begins the test/deploy cycle... I can actually reproduce it just by manually triggering one job through Test and Deploy through the UI and then immediately running curl localhost:3000. The request will take considerably longer if even one job is being prepared (usually requests to the Strider index take approximately 1-2 seconds, if a job is being prepared the request will take approximately 30 seconds).

The curl localhost:3000 will take 3-4 minutes if 3-4 jobs are being started (even with 8 concurrent workers), which is too long for GitHub/BitBucket webhooks.

SimonKaluza commented 6 years ago

I downgraded our server back to a much older version of Strider, and the problem is resolved. Not sure what Strider commits introduced this problem, but the old version we're running again now ( https://github.com/Strider-CD/strider/commit/84a6b878f0b1b3d3528d3f5f19251353f07b4ea7 ) works great.

knownasilya commented 6 years ago

I've updated the simple-runner with additional debug statements, so if you have time to investigate in the future, please do, using DEBUG=strider* to see if there is a runner error. You'll have to update the simple-runner in the plugins.