build-canaries / nevergreen

:baby_chick: A build monitor with attitude
https://nevergreen.io
Eclipse Public License 1.0
215 stars 38 forks source link

Deadline timeout #293

Closed jainsahab closed 5 years ago

jainsahab commented 5 years ago

Fixes issue #254 Simulating deadline timeout by asynchronously executing the calls to ci-server and cancelling the call after a timeout of 1 minute occurs.

@build-canaries/owners

GentlemanHal commented 5 years ago

@jainsahab thanks for submitting this pull request!

I'll hopefully take a look and get this merged sometime this week (probably at the weekend!) 😄

GentlemanHal commented 5 years ago

I merged this locally and got some linting failures

/Users/cm/code/nevergreen/test/client/gateways/Gateway.test.ts
   6:15  error  Strings must use singlequote  quotes
   6:46  error  Extra semicolon               semi
   7:28  error  Extra semicolon               semi
  11:59  error  Extra semicolon               semi
  14:38  error  Strings must use singlequote  quotes
  15:45  error  Strings must use singlequote  quotes
  16:51  error  Strings must use singlequote  quotes
  17:7   error  Extra semicolon               semi
  20:58  error  Extra semicolon               semi
  23:38  error  Strings must use singlequote  quotes
  24:45  error  Strings must use singlequote  quotes
  25:51  error  Strings must use singlequote  quotes
  26:7   error  Extra semicolon               semi
  29:60  error  Extra semicolon               semi
  32:38  error  Strings must use singlequote  quotes
  33:45  error  Strings must use singlequote  quotes
  34:51  error  Strings must use singlequote  quotes
  35:7   error  Extra semicolon               semi
  38:58  error  Extra semicolon               semi
  41:38  error  Strings must use singlequote  quotes
  42:45  error  Strings must use singlequote  quotes
  43:7   error  Extra semicolon               semi

✖ 22 problems (22 errors, 0 warnings)

Was simple to fix so I've just done so locally, but just a reminder we have a pre-push.sh script that runs all the linting and tests mimicking what CI would do 😄

Note: I also realise linting style is a bit annoying and we should probably use something like prettier instead...

Note note: I did try running prettier over the code, but it made a lot of things less readable IMO. So I wasn't ready to let my stylistic choices go just yet! 😬

jainsahab commented 5 years ago

Hi @GentlemanHal , have you just fixed them locally, or want me to push a fix ?

GentlemanHal commented 5 years ago

So the UI is timing out after 30 seconds which is the response timeout

req.timeout({response:ms}) sets maximum time to wait for the first byte to arrive from the server, but it does not limit how long the entire download can take.

which is because the server doesn't respond with anything until all the added CI servers have responded...

I'm not sure what to do other than make that a minute as well?

GentlemanHal commented 5 years ago

Hi @GentlemanHal , have you just fixed them locally, or want me to push a fix ?

I have them fixed locally already, I just haven't pushed yet as I was doing a little manual QA and noticed the issue I commented on above.

Edit: I also found a bug on the tracking page that wasn't clearing errors correctly, which I've fixed.

jainsahab commented 5 years ago

So the UI is timing out after 30 seconds which is the response timeout

req.timeout({response:ms}) sets maximum time to wait for the first byte to arrive from the server, but it does not limit how long the entire download can take.

which is because the server doesn't respond with anything until all the added CI servers have responded...

I'm not sure what to do other than make that a minute as well?

Could you please elaborate in detail? Not sure what's the problem here?

jainsahab commented 5 years ago

Hi @GentlemanHal , have you just fixed them locally, or want me to push a fix ?

I have them fixed locally already, I just haven't pushed yet as I was doing a little manual QA and noticed the issue I commented on above.

Apologies again, i was quick and pushed a fix.

GentlemanHal commented 5 years ago

The UI uses SuperAgent to make http calls and it has two timeouts, deadline and response. The response timeout is how long to wait for the first byte and the deadline is how long to wait for the entire request.

But the Nevergreen server doesn't stream any response bytes but instead waits for all the added CI servers to response before sending anything back to the UI.

This means SuperAgent is aborting the connection after 30 seconds instead of after the deadline timeout of 1 minute. This in turn means the deadline timeout you added to the Nevergreen server of 50 seconds is relevant as the connection has already been aborted.

Setting the UI response timeout to 1 minute also solves this issue, but I'm not sure if it's the correct thing to do.

Does that make sense?

jainsahab commented 5 years ago

This means SuperAgent is aborting the connection after 30 seconds instead of after the deadline timeout of 1 minute. This in turn means the deadline timeout you added to the Nevergreen server of 50 seconds is relevant as the connection has already been aborted.

If you are saying that UI timeout is solving this issue? I don't think so. UI timeout will only close the connection of UI -> nevergreen server, but the nevergreen server -> UI server connection is not closed yet and it'll keep the CI server engaged. That's the problem we are trying to solve to prevent the overload on already slow CI server.

which is because the server doesn't respond with anything until all the added CI servers have responded...

Yes, nevergreen server won't respond with anything until all the packets have arrived from CI server to nevergreen server. and if that takes more than 50 seconds, It'll close the connection to CI server.

jainsahab commented 5 years ago

Could you please check if this statement is logged when UI is timed out?

If we are getting that statement printed right after UI is timed out. then Yes, we don't need timeout on the nevergreen server

GentlemanHal commented 5 years ago

For now I'm going to commit this with the response timeout also set to 1 minute. The only real issue I can see with doing this is if the Nevergreen server itself is having issues, you'd have to wait a minute to find which isn't that big a deal.

But the advantage of allowing users who have added multiple CI servers to still get a response from working servers is huge.

jainsahab commented 5 years ago

Hi @GentlemanHal , I quickly tried out above things and made some changes to cctray mocks script.

image

With this change, even after UI is time(d)out after 30 seconds, nevergreen is closing the connection after 50 seconds. which is good since it's not waiting permanently for CI server to respond.

Only a UI timeout won't solve the problem, because nevergreen server never close the connection explicitly which results in increased load on CI.

GentlemanHal commented 5 years ago

I think we are both on the same page but perhaps slightly different paragraphs! 😆

The UI timing out after 30 seconds is bad, because it will automatically trigger a retry. Your changes of timing out the server after 50 seconds still helps to take some load of off the broken CI server but you'd basically still end up with 2 requests constantly.

The correct solution is to timeout the server before timing out the UI. This has the advantage of only ever sending one request at a time to the CI server and it also means if the user has added multiple CI servers, working servers will still return projects correctly (on the tracking page this doesn't matter but makes a huge difference to the monitor page!).

The simplest fix is to also set the UI response timeout to 1 minute, which I've already done locally and tested.

More complicated fixes would involve streaming data back as soon as it was available, but that is certainly out of scope for now but is perhaps something we could look into for the future.

jainsahab commented 5 years ago

Yes @GentlemanHal , agreed. that makes complete sense.

what about these conflicts, I am wondering if my last commit for lint fix is causing those conflicts or what else?

GentlemanHal commented 5 years ago

what about these conflicts, I am wondering if my last commit for lint fix is causing those conflicts or what else?

I've pushed all my changes to master including the lint fixes I made. I probably shouldn't have done that? Either way I think we are good in master so this pull request can be closed 😄

jainsahab commented 5 years ago

@GentlemanHal Just one last thing Should I push a fix to resolve these conflicts?

jainsahab commented 5 years ago

I've pushed all my changes to master including the lint fixes I made. I probably shouldn't have done that? Either way I think we are good in master so this pull request can be closed 😄

Ohh, I am not sure, if this way it will be counted as a valid PR for hacktoberfest. Anyways, no worries.

GentlemanHal commented 5 years ago

Ohh, I am not sure, if this way it will be counted as a valid PR for hacktoberfest. Anyways, no worries.

Opps! I did follow the command line instructions to merge the original commits and github does know you were the author, so hopefully it will?!

jainsahab commented 5 years ago

Opps! I did follow the command line instructions to merge the original commits and github does know you were the author, so hopefully it will?!

I hope so 🤞, Meanwhile, I'll look into other issues and make them count.

GentlemanHal commented 5 years ago

The build has failed on CI at the functional tests 😢

Trying to fetch https://raw.githubusercontent.com/build-canaries/nevergreen/master/cctray_xml_feed_mock/resources/cctray.xml (which is the "CI server" the functional test adds on CI) also fails.

My first thought is the fact the URL is https and not http, as all the http URLs from the fake CI server work correctly.

Any ideas?

jainsahab commented 5 years ago

https and http shouldn't be causing this. Could you please tell me, how can i run these test locally against https://raw.githubusercontent.com/build-canaries/nevergreen/master/cctray_xml_feed_mock/resources/cctray.xml?

GentlemanHal commented 5 years ago

You need to start the server manually (easiest way is using ./develop.sh) then run:

CYPRESS_TRAY_URL="https://raw.githubusercontent.com/build-canaries/nevergreen/master/cctray_xml_feed_mock/resources/cctray.xml" CYPRESS_TRAY_USERNAME="" CYPRESS_TRAY_PASSWORD="" npm run cypress:open

(remove the :open if you prefer to run them headless)

You can also just manually add the URL on the tracking page.

jainsahab commented 5 years ago

Hi @GentlemanHal , Sorry for delay.

image

I found this error, and my doubt is, in cc_tray.xml this specific line is missing <?xml version="1.0" encoding="utf-8"?> as we have in go_cd.xml and it's happening only in case of remote url (probably https could be a reason) but not happening against local mocks (http ones).

GentlemanHal commented 5 years ago

I think I figured it out, since we are getting the raw response from the future we are no longer using all the default middleware.

The response from GitHub is gzip encoded, and the default middleware was previously decoding it for us.

https://github.com/dakrone/clj-http/blob/3.x/src/clj_http/client.clj#L1094

Edit: doing some console logging I can see the raw response is a org.apache.http.nio.entity.ContentInputStream and if we allow the middleware to run we get a java.util.zip.GZIPInputStream so this is definitely the issue.

Not really sure the best way to fix, I don't really want to be running all the middleware manually and I assume we'd have to make the entire server async if we wanted to use the callbacks?!

jainsahab commented 5 years ago

But we are still using the old methods like clj-http.client/get, clj-http.client/post etc. If you trace the calls, you'll find middlewares are still being wrapped around request.

Yes, we are getting org.apache.http.nio.entity.ContentInputStream with my changes but earlier we were getting java.io.FileInputStream, So it shouldn't make any difference. It should be something else which is causing this problem.

GentlemanHal commented 5 years ago

We were getting a clj_http.core.proxy$java.io.FilterInputStream before, https://github.com/dakrone/clj-http/blob/3.x/src/clj_http/core.clj#L461

I'm sure because we are waiting directly on the future we don't run this code https://github.com/dakrone/clj-http/blob/3.x/src/clj_http/core.clj#L657 which is what is wrapping the request into a nice clojure request, we are just getting the raw Java request directly.

Reverting the changes, makes it work again correctly. Which I've done for now to fix the build pipeline.

jainsahab commented 5 years ago

Yeah @GentlemanHal , for now reverting is fine. I'll also dig in more and see what's going on in there?

jainsahab commented 5 years ago

Hi @GentlemanHal , I've pushed a fix to the same branch. Please review it and let me know if you have any feedbacks. I had to force push the code, since SHA ids were chagned in nevergreen's master branch.

jainsahab commented 5 years ago

I think I figured it out, since we are getting the raw response from the future we are no longer using all the default middleware.

The response from GitHub is gzip encoded, and the default middleware was previously decoding it for us.

You were right, we were using the raw response which caused this issue. At the same time, all middlewares were being used but they didn't run since we did not use respond and raise callbacks. I've verified it locally and have run the automation test using CYPRESS_TRAY_URL="https://raw.githubusercontent.com/build-canaries/nevergreen/master/cctray_xml_feed_mock/resources/cctray.xml" CYPRESS_TRAY_USERNAME="" CYPRESS_TRAY_PASSWORD="" npm run cypress and have also verified pre-push.sh script before i pushed my changes.

GentlemanHal commented 5 years ago

Merged and passed CI successfully, thanks for working on this @jainsahab 👍