Closed jainsahab closed 5 years ago
@jainsahab thanks for submitting this pull request!
I'll hopefully take a look and get this merged sometime this week (probably at the weekend!) 😄
I merged this locally and got some linting failures
/Users/cm/code/nevergreen/test/client/gateways/Gateway.test.ts
6:15 error Strings must use singlequote quotes
6:46 error Extra semicolon semi
7:28 error Extra semicolon semi
11:59 error Extra semicolon semi
14:38 error Strings must use singlequote quotes
15:45 error Strings must use singlequote quotes
16:51 error Strings must use singlequote quotes
17:7 error Extra semicolon semi
20:58 error Extra semicolon semi
23:38 error Strings must use singlequote quotes
24:45 error Strings must use singlequote quotes
25:51 error Strings must use singlequote quotes
26:7 error Extra semicolon semi
29:60 error Extra semicolon semi
32:38 error Strings must use singlequote quotes
33:45 error Strings must use singlequote quotes
34:51 error Strings must use singlequote quotes
35:7 error Extra semicolon semi
38:58 error Extra semicolon semi
41:38 error Strings must use singlequote quotes
42:45 error Strings must use singlequote quotes
43:7 error Extra semicolon semi
✖ 22 problems (22 errors, 0 warnings)
Was simple to fix so I've just done so locally, but just a reminder we have a pre-push.sh
script that runs all the linting and tests mimicking what CI would do 😄
Note: I also realise linting style is a bit annoying and we should probably use something like prettier
instead...
Note note: I did try running prettier over the code, but it made a lot of things less readable IMO. So I wasn't ready to let my stylistic choices go just yet! 😬
Hi @GentlemanHal , have you just fixed them locally, or want me to push a fix ?
So the UI is timing out after 30 seconds which is the response timeout
req.timeout({response:ms}) sets maximum time to wait for the first byte to arrive from the server, but it does not limit how long the entire download can take.
which is because the server doesn't respond with anything until all the added CI servers have responded...
I'm not sure what to do other than make that a minute as well?
Hi @GentlemanHal , have you just fixed them locally, or want me to push a fix ?
I have them fixed locally already, I just haven't pushed yet as I was doing a little manual QA and noticed the issue I commented on above.
Edit: I also found a bug on the tracking page that wasn't clearing errors correctly, which I've fixed.
So the UI is timing out after 30 seconds which is the response timeout
req.timeout({response:ms}) sets maximum time to wait for the first byte to arrive from the server, but it does not limit how long the entire download can take.
which is because the server doesn't respond with anything until all the added CI servers have responded...
I'm not sure what to do other than make that a minute as well?
Could you please elaborate in detail? Not sure what's the problem here?
Hi @GentlemanHal , have you just fixed them locally, or want me to push a fix ?
I have them fixed locally already, I just haven't pushed yet as I was doing a little manual QA and noticed the issue I commented on above.
Apologies again, i was quick and pushed a fix.
The UI uses SuperAgent to make http calls and it has two timeouts, deadline and response. The response timeout is how long to wait for the first byte and the deadline is how long to wait for the entire request.
But the Nevergreen server doesn't stream any response bytes but instead waits for all the added CI servers to response before sending anything back to the UI.
This means SuperAgent is aborting the connection after 30 seconds instead of after the deadline timeout of 1 minute. This in turn means the deadline timeout you added to the Nevergreen server of 50 seconds is relevant as the connection has already been aborted.
Setting the UI response timeout to 1 minute also solves this issue, but I'm not sure if it's the correct thing to do.
Does that make sense?
This means SuperAgent is aborting the connection after 30 seconds instead of after the deadline timeout of 1 minute. This in turn means the deadline timeout you added to the Nevergreen server of 50 seconds is relevant as the connection has already been aborted.
If you are saying that UI timeout is solving this issue? I don't think so. UI timeout will only close the connection of UI -> nevergreen server, but the nevergreen server -> UI server connection is not closed yet and it'll keep the CI server engaged. That's the problem we are trying to solve to prevent the overload on already slow CI server.
which is because the server doesn't respond with anything until all the added CI servers have responded...
Yes, nevergreen server won't respond with anything until all the packets have arrived from CI server to nevergreen server. and if that takes more than 50 seconds, It'll close the connection to CI server.
Could you please check if this statement is logged when UI is timed out?
If we are getting that statement printed right after UI is timed out. then Yes, we don't need timeout on the nevergreen server
For now I'm going to commit this with the response timeout also set to 1 minute. The only real issue I can see with doing this is if the Nevergreen server itself is having issues, you'd have to wait a minute to find which isn't that big a deal.
But the advantage of allowing users who have added multiple CI servers to still get a response from working servers is huge.
Hi @GentlemanHal , I quickly tried out above things and made some changes to cctray mocks script.
With this change, even after UI is time(d)out after 30 seconds, nevergreen is closing the connection after 50 seconds. which is good since it's not waiting permanently for CI server to respond.
Only a UI timeout won't solve the problem, because nevergreen server never close the connection explicitly which results in increased load on CI.
I think we are both on the same page but perhaps slightly different paragraphs! 😆
The UI timing out after 30 seconds is bad, because it will automatically trigger a retry. Your changes of timing out the server after 50 seconds still helps to take some load of off the broken CI server but you'd basically still end up with 2 requests constantly.
The correct solution is to timeout the server before timing out the UI. This has the advantage of only ever sending one request at a time to the CI server and it also means if the user has added multiple CI servers, working servers will still return projects correctly (on the tracking page this doesn't matter but makes a huge difference to the monitor page!).
The simplest fix is to also set the UI response timeout to 1 minute, which I've already done locally and tested.
More complicated fixes would involve streaming data back as soon as it was available, but that is certainly out of scope for now but is perhaps something we could look into for the future.
Yes @GentlemanHal , agreed. that makes complete sense.
what about these conflicts, I am wondering if my last commit for lint fix
is causing those conflicts or what else?
what about these conflicts, I am wondering if my last commit for
lint fix
is causing those conflicts or what else?
I've pushed all my changes to master including the lint fixes I made. I probably shouldn't have done that? Either way I think we are good in master so this pull request can be closed 😄
@GentlemanHal Just one last thing Should I push a fix to resolve these conflicts?
I've pushed all my changes to master including the lint fixes I made. I probably shouldn't have done that? Either way I think we are good in master so this pull request can be closed 😄
Ohh, I am not sure, if this way it will be counted as a valid PR for hacktoberfest. Anyways, no worries.
Ohh, I am not sure, if this way it will be counted as a valid PR for hacktoberfest. Anyways, no worries.
Opps! I did follow the command line instructions to merge the original commits and github does know you were the author, so hopefully it will?!
Opps! I did follow the command line instructions to merge the original commits and github does know you were the author, so hopefully it will?!
I hope so 🤞, Meanwhile, I'll look into other issues and make them count.
The build has failed on CI at the functional tests 😢
Trying to fetch https://raw.githubusercontent.com/build-canaries/nevergreen/master/cctray_xml_feed_mock/resources/cctray.xml
(which is the "CI server" the functional test adds on CI) also fails.
My first thought is the fact the URL is https and not http, as all the http URLs from the fake CI server work correctly.
Any ideas?
https
and http
shouldn't be causing this.
Could you please tell me, how can i run these test locally against https://raw.githubusercontent.com/build-canaries/nevergreen/master/cctray_xml_feed_mock/resources/cctray.xml?
You need to start the server manually (easiest way is using ./develop.sh
) then run:
CYPRESS_TRAY_URL="https://raw.githubusercontent.com/build-canaries/nevergreen/master/cctray_xml_feed_mock/resources/cctray.xml" CYPRESS_TRAY_USERNAME="" CYPRESS_TRAY_PASSWORD="" npm run cypress:open
(remove the :open
if you prefer to run them headless)
You can also just manually add the URL on the tracking page.
Hi @GentlemanHal , Sorry for delay.
I found this error, and my doubt is, in cc_tray.xml this specific line is missing <?xml version="1.0" encoding="utf-8"?>
as we have in go_cd.xml and it's happening only in case of remote url (probably https could be a reason) but not happening against local mocks (http ones).
I think I figured it out, since we are getting the raw response from the future we are no longer using all the default middleware.
The response from GitHub is gzip encoded, and the default middleware was previously decoding it for us.
https://github.com/dakrone/clj-http/blob/3.x/src/clj_http/client.clj#L1094
Edit: doing some console logging I can see the raw response is a org.apache.http.nio.entity.ContentInputStream
and if we allow the middleware to run we get a java.util.zip.GZIPInputStream
so this is definitely the issue.
Not really sure the best way to fix, I don't really want to be running all the middleware manually and I assume we'd have to make the entire server async if we wanted to use the callbacks?!
But we are still using the old methods like clj-http.client/get
, clj-http.client/post
etc. If you trace the calls, you'll find middlewares are still being wrapped around request.
Yes, we are getting org.apache.http.nio.entity.ContentInputStream
with my changes but earlier we were getting java.io.FileInputStream
, So it shouldn't make any difference. It should be something else which is causing this problem.
We were getting a clj_http.core.proxy$java.io.FilterInputStream
before, https://github.com/dakrone/clj-http/blob/3.x/src/clj_http/core.clj#L461
I'm sure because we are waiting directly on the future we don't run this code https://github.com/dakrone/clj-http/blob/3.x/src/clj_http/core.clj#L657 which is what is wrapping the request into a nice clojure request, we are just getting the raw Java request directly.
Reverting the changes, makes it work again correctly. Which I've done for now to fix the build pipeline.
Yeah @GentlemanHal , for now reverting is fine. I'll also dig in more and see what's going on in there?
Hi @GentlemanHal , I've pushed a fix to the same branch. Please review it and let me know if you have any feedbacks. I had to force push the code, since SHA ids were chagned in nevergreen's master branch.
I think I figured it out, since we are getting the raw response from the future we are no longer using all the default middleware.
The response from GitHub is gzip encoded, and the default middleware was previously decoding it for us.
You were right, we were using the raw response which caused this issue. At the same time, all middlewares were being used but they didn't run since we did not use respond
and raise
callbacks. I've verified it locally and have run the automation test using CYPRESS_TRAY_URL="https://raw.githubusercontent.com/build-canaries/nevergreen/master/cctray_xml_feed_mock/resources/cctray.xml" CYPRESS_TRAY_USERNAME="" CYPRESS_TRAY_PASSWORD="" npm run cypress
and have also verified pre-push.sh
script before i pushed my changes.
Merged and passed CI successfully, thanks for working on this @jainsahab 👍
Fixes issue #254 Simulating deadline timeout by asynchronously executing the calls to
ci-server
and cancelling the call after a timeout of 1 minute occurs.@build-canaries/owners