Closed U-DON closed 7 years ago
Thanks for opening this @U-DON . Right now from MHV when trying to get a session I get:
<Error><developerMessage>Severity[Error]:database.exception;</developerMessage><errorCode>901</errorCode><message>Authentication Service Error</message></Error>
Have sent an email and will follow up as soon as I hear back. I guess the Heroku API is the one to use in the mean time. Sorry about this; so frustrating!
This is working for me:
{"data":[{"id":"0","type":"va_healthcare_messaging_folders","attributes":{"folder_id":0,"name":"Inbox","count":2,"unread_count":0,"system_folder":true},"links":{"self":"https://staging.vets.gov/api/v0/messaging/health/folders/0"}},{"id":"-2","type":"va_healthcare_messaging_folders","attributes":{"folder_id":-2,"name":"Drafts","count":31,"unread_count":28,"system_folder":true},"links":{"self":"https://staging.vets.gov/api/v0/messaging/health/folders/-2"}}, ... and so on.
Occasionally, I'll get an internal server error on either a GET for folder messages or thread messages, while the other one works just fine, even though both requests are made at the time the page is loading. Don't know what to make of it. I refresh until they both work, but it usually doesn't take more than one refresh.
Going to keep this open but move to the backlog. Anyone please post here when the API starts returning 500s again so we can keep track of these!
@U-DON how has the SM API been this week? still flakey? more stable?
@ayaleloehr Looks like it didn't go down for any extended period of time where we would get 500s, but it has been giving a lot of intermittent 400s lately. I often have to do multiple refreshes to get all the requests on a single page load to return correctly.
errors: {
[
{
code: "900",
detail: "Authentication service error",
status: "400",
title: "Operation failed"
}
]
}
Very interesting. API documentation says 900 is "Mailbox Service Error" and 901 is the "Authentication Service Error", but clearly from the above 900 is giving "Authentication Service Error". Strange.
Will follow up with Ken both about why this is mislabeled, which error it actually is, and why it is happening. Thanks @U-DON
@U-DON: the MHV API team just wrote back saying they think this was an issue with one of their instances behind the load balancer not functioning properly. They restarted the machines, which they hope fixes it. Can you please let me know if you get these intermittent errors again? And if so, having a timestamp of when it happened would be great so MHV can check their logs. Thanks!
@ayaleloehr that 900 is a code we're setting on our server side -- it references title "operation failed" it has nothing to do with what MHV uses for code, strictly coincidental. The detail is the only thing that will be dynamic for this particular error type, and that is relayed back as whatever the specific client response was that resulted in failure.
Got the error again just now.
GET /api/v0/messaging/health/folders/0/messages HTTP/1.1
Host: localhost:3000
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:48.0) Gecko/20100101 Firefox/48.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
x-key-inflection: camel
origin: http://localhost:3000
Referer: http://localhost:3000/messaging/folder/0
Connection: keep-alive
If-None-Match: W/"22d5f18eec4837c3ebc34e5b76de226a"
Cache-Control: max-age=0
HTTP/1.1 400 Bad Request
X-Powered-By: Express
Server: nginx/1.8.1
Date: Tue, 27 Sep 2016 15:24:05 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 107
Connection: close
access-control-allow-headers: Authorization
Cache-Control: no-cache
X-Git-SHA: 5ce0a67fc80dd02ea010fca0a433d940fbf854e0
X-GitHub-Repository: https://github.com/department-of-veterans-affairs/vets-api
X-Request-Id: 765f4c58-cd26-4d7c-a578-9f84f95eb843
X-Runtime: 0.458725
access-control-allow-credentials: true
{"errors":[{"title":"Operation failed","detail":"Authentication service error","code":"900","status":400}]}
Anything that is "operation failed" is coming directly from MHV more or less, so this could suggest that the service is temporary down? I know they are doing QA on their end, would be great if we could get on an email list that lets us know when they are doing deployments that could result in downtime on these system.
It worked for one request but not for another, although both were made on the same page load. And both worked on the previous page load. Unless certain parts of the service can go down independently.
Here's another response from a failed request if it's helpful for them to inspect. This failed for getting threads, while getting all folders worked.
GET /api/v0/messaging/health/messages/621407/thread HTTP/1.1
Host: localhost:3000
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:48.0) Gecko/20100101 Firefox/48.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
x-key-inflection: camel
origin: http://localhost:3000
Referer: http://localhost:3000/messaging/thread/621407
Connection: keep-alive
If-None-Match: W/"01a01dc1e7565e64f2c718b7df43ba48"
HTTP/1.1 400 Bad Request
X-Powered-By: Express
Date: Tue, 27 Sep 2016 15:37:08 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 107
Connection: close
access-control-allow-headers: Authorization
Cache-Control: no-cache
X-Git-SHA: 5ce0a67fc80dd02ea010fca0a433d940fbf854e0
X-GitHub-Repository: https://github.com/department-of-veterans-affairs/vets-api
X-Request-Id: d48ecc1e-7ee1-4cab-bc6f-9bcef8aa33f3
X-Runtime: 0.389722
access-control-allow-credentials: true
Set-Cookie: TS018eb766=01c8917e48c10f4cc83d3ef3b9de1a13f486f4994dee727358122049d6109870261ecbb402; Max-Age=900; Path=/
{"errors":[{"title":"Operation failed","detail":"Authentication service error","code":"900","status":400}]}
This could be related to NTP as well. Maybe thr time for authrntication is out of sync and it causes intermittant issues? @jkassemi any thoughts?
Yeah, that's possible, but can't say for sure. Sounds like it's still one instance in their pool causing the trouble.
Can you guys run a script that attempts a valid request every 5 seconds or so that records the timestamp and response? A history of failures would help them debug.
sending along a parameter with a unique string value with each request from that script would help with grepping logs, too.
@U-DON I didn't follow up with MHV on this yesterday (sorry about that!). Can you post the timestamps here for a couple times this happens today and I'll follow up with MHV so they can look at their logs for those times? Thank you!
Sure. Here's one:
GET /api/v0/messaging/health/folders/-2/messages HTTP/1.1
Date: Thu, 29 Sep 2016 15:47:00 GMT
After multiple email exchanges with MHV I believe we are actually creating a race condition in our API that is causing this. Let me try to write up what I think is happening.
The inbox page on vets.gov SM needs to make calls on page load to /folders
to get the folder list and /folders/:id/messages
to get the messages in the inbox. Here is what I believe is happening when this fails:
1) /folders
calls to MHV SM API to get a session token. MHV returns Session Token A
2) /folders/:id/messages
calls to MHV SM API to get a session token. MHV returns Session Token B, which invalidates session token A.
3) /folders
uses Session Token A to call to get the folders list. Session token A is invalid so a "900 Authentication service error" is returned.
4) /folders/:id/messages
uses Session Token B to call to get the messages in the inbox. Session token B is valid and this call returns successfully.
I think if we cached session tokens, which it definitely seems like we should be doing, it might make this issue go away. Thoughts on what it would take to start doing that @saneshark ?
I'm struggling to replicate this with curl unfortunately, so it's possible my logic above isn't actually correct about the invalid session tokens (especially because those should return a 109, not 900 error).
BUT, I still think this is a race condition we're causing on our side, AND we should be caching session tokens and checking to see if they exist in Redis before calling to get a new one anyways so I think we should implement that as soon as we can to see if it fixes this problem.
Final thought is that caching session tokens should reduce query time greatly. Right now we do 2 round trips to MHV for every API call (get session token and then making the real call). Reducing this to 1 round trip in the average case (we will usually have a cached token) will mean query times drop in half, which would be huge! And the MHV team has also asked us to cache tokens so we don't overload their system, so that's a final reason we need to do so. :)
Found error in our logs:
ERROR -- : /srv/platform-api/src/lib/sm/client.rb:71:in `process_response_or_error'
/srv/platform-api/src/lib/sm/client.rb:56:in `perform'
/srv/platform-api/src/lib/sm/api/sessions.rb:6:in `get_session'
/srv/platform-api/src/lib/sm/client.rb:47:in `authenticate'
/srv/platform-api/src/app/controllers/sm_controller.rb:20:in `authenticate_client'
/srv/platform-api/src/vendor/bundle/ruby/2.3.0/gems/activesupport-4.2.7.1/lib/active_support/callbacks.rb:432:in `block in make_lambda'
I have a theory as to what might have caused this. Previously for messages belonging a folder we were making two requests in rapid succession. One to fetch a folder's attributes to get the total message count, and second to iterate through the pages if count is > 250 (the maximum returned in a single request).
This PR addresses this with a single request, instead looping through if the call for the first page (and subsequent pages) returns Maximum Allowable Per Page. https://github.com/department-of-veterans-affairs/vets-api/pull/193
My theory is that MHV caches the session data on the first request for folders, but the second request possibly hits a different app server in a load balanced pool, and the second app server maybe doesn't have a chance to properly reference the newly created session. This is only a theory of course, but could explain the Authentication error.
Kam's PR referenced above was just deployed. Eugene, if (when) you get this error again, please post the timestamp here so I can check logs again
Been a while, but just got one today.
GET /api/v0/messaging/health/folders/0
Date: Thu, 06 Oct 2016 20:47:23 GMT
So these failures have slowed down a lot since @saneshark put in #193 where we started making fewer requests for the same user at the same time. I have a strong feeling that caching the session token will help alleviate the remaining issues. Going to leave open until we get caching working so that once it is, we can track if this happens again.
I've been getting this a lot more today. Here's another time stamp.
GET /api/v0/messaging/health/folders/-2/messages?page=2
Date: Tue, 11 Oct 2016 20:37:29 GMT
Error from that call is is:
I, [2016-10-11T20:37:28.546118 #2007] INFO -- : Started GET "/v0/messaging/health/folders/-2/messages?page=2" at 2016-10-11 20:37:28 +0000
I, [2016-10-11T20:37:28.548733 #2007] INFO -- : Processing by V0::MessagesController#index as JSON
I, [2016-10-11T20:37:28.548780 #2007] INFO -- : Parameters: {"page"=>"2", "folder_id"=>"-2"}
E, [2016-10-11T20:37:28.915058 #2007] ERROR -- : Authentication Service Error.
E, [2016-10-11T20:37:28.915237 #2007] ERROR -- : /srv/platform-api/src/lib/sm/client.rb:85:in `process_other'
/srv/platform-api/src/lib/sm/client.rb:62:in `process_response_or_error'
/srv/platform-api/src/lib/sm/client.rb:58:in `perform'
/srv/platform-api/src/lib/sm/api/sessions.rb:6:in `get_session'
/srv/platform-api/src/lib/sm/client.rb:49:in `authenticate'
/srv/platform-api/src/app/controllers/sm_controller.rb:20:in `authenticate_client'
/srv/platform-api/src/vendor/bundle/ruby/2.3/gems/activesupport-4.2.7.1/lib/active_support/callbacks.rb:432:in `block in make_lambda'
It looks like there is also a call to /v0/messaging/health/folders/-2
at almost exactly the same time that succeeds:
I, [2016-10-11T20:37:28.445572 #1996] INFO -- : Started GET "/v0/messaging/health/folders/-2" at 2016-10-11 20:37:28 +0000
I, [2016-10-11T20:37:28.447061 #1996] INFO -- : Processing by V0::FoldersController#show as JSON
I, [2016-10-11T20:37:28.447104 #1996] INFO -- : Parameters: {"id"=>"-2"}
I, [2016-10-11T20:37:29.188828 #1996] INFO -- : [active_model_serializers] Rendered FolderSerializer with ActiveModelSerializers::Adapter::JsonApi (0.68ms)
I, [2016-10-11T20:37:29.188988 #1996] INFO -- : Completed 200 OK in 742ms (Views: 1.1ms | ActiveRecord: 0.0ms)
This second call pasted above comes in the logs after the first one, but the timestamp for when it begins is actually before the first. strange; maybe because it finishes after the first one pasted above.
@U-DON which page was this on and what was that second call for?
I'm still pretty convinced that this is happening because we're not caching on our side (which @saneshark has on his radar to work on this week) and that that is causing a race condition.
@ayaleloehr This came up when I was navigating through the pages of one of the folders. We make two calls when we access a folder: one for the folder itself and another for its messages.
@ayaleloehr What do you think about labeling this as veterans-day, so we can be sure to track nailing down the race condition issue before launch?
We should be able to get better testing around this now that we're doing session caching.
@ayaleloehr Still occasionally getting this. Much rarer than before, but it still shows up from time to time. Here's a time stamp:
/api/v0/messaging/health/folders/-1/messages
Date: Fri, 21 Oct 2016 00:33:16 GMT
Curious if this is happening since the switch to using the userToken. will leave open for a few more days to see if it happens again.
Willing to bet that adding some minor caching as a future PR post veterans day will finally nip this one in the bud. Depending on the number of messages in a folder, it is a lot of parsing / processing that needs to take place, and thats assuming only a single request is necessary (could be longer with 250+ messages in a folder). #342
@saneshark @ayaleloehr Anything left to do here? (we are trying to clean up issues)
I think this is working better. Someone can reopen if they disagree.
Opening an issue to track when the secure messaging API returns internal server errors (code 500) on staging.
Link for convenience: https://staging.vets.gov/api/v0/messaging/health/folders.
Occurred again at the time of submitting this issue.