department-of-veterans-affairs / vets-api

API powering VA.gov
https://api.va.gov
Other
206 stars 62 forks source link

Secure Messaging API Returns Internal Server Error on Staging #103

Closed U-DON closed 7 years ago

U-DON commented 8 years ago

Opening an issue to track when the secure messaging API returns internal server errors (code 500) on staging.

Link for convenience: https://staging.vets.gov/api/v0/messaging/health/folders.

Occurred again at the time of submitting this issue.

ayaleloehr commented 8 years ago

Thanks for opening this @U-DON . Right now from MHV when trying to get a session I get:

<Error><developerMessage>Severity[Error]:database.exception;</developerMessage><errorCode>901</errorCode><message>Authentication Service Error</message></Error>

Have sent an email and will follow up as soon as I hear back. I guess the Heroku API is the one to use in the mean time. Sorry about this; so frustrating!

mphprogrammer commented 8 years ago

This is working for me:

{"data":[{"id":"0","type":"va_healthcare_messaging_folders","attributes":{"folder_id":0,"name":"Inbox","count":2,"unread_count":0,"system_folder":true},"links":{"self":"https://staging.vets.gov/api/v0/messaging/health/folders/0"}},{"id":"-2","type":"va_healthcare_messaging_folders","attributes":{"folder_id":-2,"name":"Drafts","count":31,"unread_count":28,"system_folder":true},"links":{"self":"https://staging.vets.gov/api/v0/messaging/health/folders/-2"}}, ... and so on.

U-DON commented 8 years ago

Occasionally, I'll get an internal server error on either a GET for folder messages or thread messages, while the other one works just fine, even though both requests are made at the time the page is loading. Don't know what to make of it. I refresh until they both work, but it usually doesn't take more than one refresh.

ayaleloehr commented 8 years ago

Going to keep this open but move to the backlog. Anyone please post here when the API starts returning 500s again so we can keep track of these!

ayaleloehr commented 8 years ago

@U-DON how has the SM API been this week? still flakey? more stable?

U-DON commented 8 years ago

@ayaleloehr Looks like it didn't go down for any extended period of time where we would get 500s, but it has been giving a lot of intermittent 400s lately. I often have to do multiple refreshes to get all the requests on a single page load to return correctly.

errors: {
    [
        {
            code: "900",
            detail: "Authentication service error",
            status: "400",
            title: "Operation failed"
        }
    ]
}
ayaleloehr commented 8 years ago

Very interesting. API documentation says 900 is "Mailbox Service Error" and 901 is the "Authentication Service Error", but clearly from the above 900 is giving "Authentication Service Error". Strange.

Will follow up with Ken both about why this is mislabeled, which error it actually is, and why it is happening. Thanks @U-DON

ayaleloehr commented 8 years ago

@U-DON: the MHV API team just wrote back saying they think this was an issue with one of their instances behind the load balancer not functioning properly. They restarted the machines, which they hope fixes it. Can you please let me know if you get these intermittent errors again? And if so, having a timestamp of when it happened would be great so MHV can check their logs. Thanks!

saneshark commented 8 years ago

@ayaleloehr that 900 is a code we're setting on our server side -- it references title "operation failed" it has nothing to do with what MHV uses for code, strictly coincidental. The detail is the only thing that will be dynamic for this particular error type, and that is relayed back as whatever the specific client response was that resulted in failure.

U-DON commented 8 years ago

Got the error again just now.

Request Headers

GET /api/v0/messaging/health/folders/0/messages HTTP/1.1
Host: localhost:3000
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:48.0) Gecko/20100101 Firefox/48.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
x-key-inflection: camel
origin: http://localhost:3000
Referer: http://localhost:3000/messaging/folder/0
Connection: keep-alive
If-None-Match: W/"22d5f18eec4837c3ebc34e5b76de226a"
Cache-Control: max-age=0

Response Headers

HTTP/1.1 400 Bad Request
X-Powered-By: Express
Server: nginx/1.8.1
Date: Tue, 27 Sep 2016 15:24:05 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 107
Connection: close
access-control-allow-headers: Authorization
Cache-Control: no-cache
X-Git-SHA: 5ce0a67fc80dd02ea010fca0a433d940fbf854e0
X-GitHub-Repository: https://github.com/department-of-veterans-affairs/vets-api
X-Request-Id: 765f4c58-cd26-4d7c-a578-9f84f95eb843
X-Runtime: 0.458725
access-control-allow-credentials: true

Response

{"errors":[{"title":"Operation failed","detail":"Authentication service error","code":"900","status":400}]}
saneshark commented 8 years ago

Anything that is "operation failed" is coming directly from MHV more or less, so this could suggest that the service is temporary down? I know they are doing QA on their end, would be great if we could get on an email list that lets us know when they are doing deployments that could result in downtime on these system.

U-DON commented 8 years ago

It worked for one request but not for another, although both were made on the same page load. And both worked on the previous page load. Unless certain parts of the service can go down independently.

Here's another response from a failed request if it's helpful for them to inspect. This failed for getting threads, while getting all folders worked.

Request

GET /api/v0/messaging/health/messages/621407/thread HTTP/1.1
Host: localhost:3000
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:48.0) Gecko/20100101 Firefox/48.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
x-key-inflection: camel
origin: http://localhost:3000
Referer: http://localhost:3000/messaging/thread/621407
Connection: keep-alive
If-None-Match: W/"01a01dc1e7565e64f2c718b7df43ba48"

Response

HTTP/1.1 400 Bad Request
X-Powered-By: Express
Date: Tue, 27 Sep 2016 15:37:08 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 107
Connection: close
access-control-allow-headers: Authorization
Cache-Control: no-cache
X-Git-SHA: 5ce0a67fc80dd02ea010fca0a433d940fbf854e0
X-GitHub-Repository: https://github.com/department-of-veterans-affairs/vets-api
X-Request-Id: d48ecc1e-7ee1-4cab-bc6f-9bcef8aa33f3
X-Runtime: 0.389722
access-control-allow-credentials: true
Set-Cookie: TS018eb766=01c8917e48c10f4cc83d3ef3b9de1a13f486f4994dee727358122049d6109870261ecbb402; Max-Age=900; Path=/
{"errors":[{"title":"Operation failed","detail":"Authentication service error","code":"900","status":400}]}
saneshark commented 8 years ago

This could be related to NTP as well. Maybe thr time for authrntication is out of sync and it causes intermittant issues? @jkassemi any thoughts?

jkassemi commented 8 years ago

Yeah, that's possible, but can't say for sure. Sounds like it's still one instance in their pool causing the trouble.

Can you guys run a script that attempts a valid request every 5 seconds or so that records the timestamp and response? A history of failures would help them debug.

jkassemi commented 8 years ago

sending along a parameter with a unique string value with each request from that script would help with grepping logs, too.

ayaleloehr commented 8 years ago

@U-DON I didn't follow up with MHV on this yesterday (sorry about that!). Can you post the timestamps here for a couple times this happens today and I'll follow up with MHV so they can look at their logs for those times? Thank you!

U-DON commented 8 years ago

Sure. Here's one:

GET /api/v0/messaging/health/folders/-2/messages HTTP/1.1
Date: Thu, 29 Sep 2016 15:47:00 GMT
ayaleloehr commented 8 years ago

After multiple email exchanges with MHV I believe we are actually creating a race condition in our API that is causing this. Let me try to write up what I think is happening.

The inbox page on vets.gov SM needs to make calls on page load to /folders to get the folder list and /folders/:id/messages to get the messages in the inbox. Here is what I believe is happening when this fails:

1) /folders calls to MHV SM API to get a session token. MHV returns Session Token A 2) /folders/:id/messages calls to MHV SM API to get a session token. MHV returns Session Token B, which invalidates session token A. 3) /folders uses Session Token A to call to get the folders list. Session token A is invalid so a "900 Authentication service error" is returned. 4) /folders/:id/messages uses Session Token B to call to get the messages in the inbox. Session token B is valid and this call returns successfully.

I think if we cached session tokens, which it definitely seems like we should be doing, it might make this issue go away. Thoughts on what it would take to start doing that @saneshark ?

ayaleloehr commented 8 years ago

I'm struggling to replicate this with curl unfortunately, so it's possible my logic above isn't actually correct about the invalid session tokens (especially because those should return a 109, not 900 error).

BUT, I still think this is a race condition we're causing on our side, AND we should be caching session tokens and checking to see if they exist in Redis before calling to get a new one anyways so I think we should implement that as soon as we can to see if it fixes this problem.

Final thought is that caching session tokens should reduce query time greatly. Right now we do 2 round trips to MHV for every API call (get session token and then making the real call). Reducing this to 1 round trip in the average case (we will usually have a cached token) will mean query times drop in half, which would be huge! And the MHV team has also asked us to cache tokens so we don't overload their system, so that's a final reason we need to do so. :)

ayaleloehr commented 8 years ago

Found error in our logs:

ERROR -- : /srv/platform-api/src/lib/sm/client.rb:71:in `process_response_or_error'
/srv/platform-api/src/lib/sm/client.rb:56:in `perform'
/srv/platform-api/src/lib/sm/api/sessions.rb:6:in `get_session'
/srv/platform-api/src/lib/sm/client.rb:47:in `authenticate'
/srv/platform-api/src/app/controllers/sm_controller.rb:20:in `authenticate_client'
/srv/platform-api/src/vendor/bundle/ruby/2.3.0/gems/activesupport-4.2.7.1/lib/active_support/callbacks.rb:432:in `block in make_lambda'
saneshark commented 8 years ago

I have a theory as to what might have caused this. Previously for messages belonging a folder we were making two requests in rapid succession. One to fetch a folder's attributes to get the total message count, and second to iterate through the pages if count is > 250 (the maximum returned in a single request).

This PR addresses this with a single request, instead looping through if the call for the first page (and subsequent pages) returns Maximum Allowable Per Page. https://github.com/department-of-veterans-affairs/vets-api/pull/193

My theory is that MHV caches the session data on the first request for folders, but the second request possibly hits a different app server in a load balanced pool, and the second app server maybe doesn't have a chance to properly reference the newly created session. This is only a theory of course, but could explain the Authentication error.

ayaleloehr commented 8 years ago

Kam's PR referenced above was just deployed. Eugene, if (when) you get this error again, please post the timestamp here so I can check logs again

U-DON commented 8 years ago

Been a while, but just got one today.

GET /api/v0/messaging/health/folders/0
Date: Thu, 06 Oct 2016 20:47:23 GMT
ayaleloehr commented 8 years ago

So these failures have slowed down a lot since @saneshark put in #193 where we started making fewer requests for the same user at the same time. I have a strong feeling that caching the session token will help alleviate the remaining issues. Going to leave open until we get caching working so that once it is, we can track if this happens again.

U-DON commented 8 years ago

I've been getting this a lot more today. Here's another time stamp.

GET /api/v0/messaging/health/folders/-2/messages?page=2
Date: Tue, 11 Oct 2016 20:37:29 GMT
ayaleloehr commented 8 years ago

Error from that call is is:

I, [2016-10-11T20:37:28.546118 #2007] INFO -- : Started GET "/v0/messaging/health/folders/-2/messages?page=2" at 2016-10-11 20:37:28 +0000
I, [2016-10-11T20:37:28.548733 #2007] INFO -- : Processing by V0::MessagesController#index as JSON
I, [2016-10-11T20:37:28.548780 #2007] INFO -- : Parameters: {"page"=>"2", "folder_id"=>"-2"}
E, [2016-10-11T20:37:28.915058 #2007] ERROR -- : Authentication Service Error.
E, [2016-10-11T20:37:28.915237 #2007] ERROR -- : /srv/platform-api/src/lib/sm/client.rb:85:in `process_other'
/srv/platform-api/src/lib/sm/client.rb:62:in `process_response_or_error'
/srv/platform-api/src/lib/sm/client.rb:58:in `perform'
/srv/platform-api/src/lib/sm/api/sessions.rb:6:in `get_session'
/srv/platform-api/src/lib/sm/client.rb:49:in `authenticate'
/srv/platform-api/src/app/controllers/sm_controller.rb:20:in `authenticate_client'
/srv/platform-api/src/vendor/bundle/ruby/2.3/gems/activesupport-4.2.7.1/lib/active_support/callbacks.rb:432:in `block in make_lambda'

It looks like there is also a call to /v0/messaging/health/folders/-2 at almost exactly the same time that succeeds:

I, [2016-10-11T20:37:28.445572 #1996] INFO -- : Started GET "/v0/messaging/health/folders/-2" at 2016-10-11 20:37:28 +0000
I, [2016-10-11T20:37:28.447061 #1996] INFO -- : Processing by V0::FoldersController#show as JSON
I, [2016-10-11T20:37:28.447104 #1996] INFO -- : Parameters: {"id"=>"-2"}
I, [2016-10-11T20:37:29.188828 #1996] INFO -- : [active_model_serializers] Rendered FolderSerializer with ActiveModelSerializers::Adapter::JsonApi (0.68ms)
I, [2016-10-11T20:37:29.188988 #1996] INFO -- : Completed 200 OK in 742ms (Views: 1.1ms | ActiveRecord: 0.0ms)

This second call pasted above comes in the logs after the first one, but the timestamp for when it begins is actually before the first. strange; maybe because it finishes after the first one pasted above.

@U-DON which page was this on and what was that second call for?

I'm still pretty convinced that this is happening because we're not caching on our side (which @saneshark has on his radar to work on this week) and that that is causing a race condition.

U-DON commented 8 years ago

@ayaleloehr This came up when I was navigating through the pages of one of the folders. We make two calls when we access a folder: one for the folder itself and another for its messages.

jcmeloni-zz commented 8 years ago

@ayaleloehr What do you think about labeling this as veterans-day, so we can be sure to track nailing down the race condition issue before launch?

saneshark commented 8 years ago

We should be able to get better testing around this now that we're doing session caching.

U-DON commented 8 years ago

@ayaleloehr Still occasionally getting this. Much rarer than before, but it still shows up from time to time. Here's a time stamp:

/api/v0/messaging/health/folders/-1/messages
Date: Fri, 21 Oct 2016 00:33:16 GMT
ayaleloehr commented 8 years ago

Curious if this is happening since the switch to using the userToken. will leave open for a few more days to see if it happens again.

saneshark commented 8 years ago

Willing to bet that adding some minor caching as a future PR post veterans day will finally nip this one in the bud. Depending on the number of messages in a folder, it is a lot of parsing / processing that needs to take place, and thats assuming only a single request is necessary (could be longer with 250+ messages in a folder). #342

jcmeloni-zz commented 7 years ago

@saneshark @ayaleloehr Anything left to do here? (we are trying to clean up issues)

ayaleloehr commented 7 years ago

I think this is working better. Someone can reopen if they disagree.