DoSomething / legacy-website

:moyai: The DoSomething.org legacy website.
https://www.dosomething.org/
MIT License
50 stars 22 forks source link

Some accounts are unable to log in due to inability to read auth code response. #7304

Closed DFurnes closed 7 years ago

DFurnes commented 7 years ago

BUG

Current Behavior

Some (hopefully small) subset of users are seeing a 404 page when Northstar redirects them back to Phoenix as part of the authorization code flow. Here's an example of a URL that 404s:

https://www.dosomething.org/us/openid-connect/northstar?code=U91hsBpSrlLVvr4afvUl1pJu98K8xhUwSrun3Jz5tj4GhW2ESODPR6Pnp2skIBeixd0QRwv8aTAWPIEn8a2hGD12wvE%2FqJdUENdhr1L%2B6Mbwhjb7HNepHSMrFmrsA6ISt5fHwMIK7hcjVrS42ZiNZOVarpO1g9Lz7ObZOfNLKnCfQF08ICsSPZG%2FyRgjkbsMKiArVvX0xzwvc61tjSQ1lVChjaKxc7yNSVM6ASwnVrx9rTf1Q7%2FosW75ei4L%2F98w8sPA0i2SbqnnDJJFEhdj8JftEF3do2Qp1rPq4fg%2BpxYN%2Fcg5uq1fze%2FEUmVv9k16np%2BFmc0bTrK0raMF0fXvwnLVWKGMImdmqLliIaBn%2FCN0CsMZcYeNHofC8jG5w10NDc%2FqHE71JvB24vYQJXwJt8kwLFf%2B4V2D35%2BVnhWQPcNLYG1hhNvfTKz1BmGPb59Ik5qE53HOmCDWMhKFEvrazACNcd05GFGeGFpfAe%2FFalRtbsFKigWqWafmJKcREAAcXxnOkzawhWGyOtN5%2BDdVtHuSBBD5Z6KwbD3E472JtgrPaF%2Bd%2F2laboxUbfrRyVTWEx0LRQGxAjn%2BZjxJyk2%2BX%2B1r5OdjshbIG5ycAkZJwDQFZ5Ghnn1RQqBenxu%2Bso8vo0vwOYWdfiaDLy4yUgkOEnLtbSRve8LDS2MZmeVa76s%3D&state=f0d97e3c28ee80e6bbc2940d4d3fc0af

For comparison, a working callback URL from my last login looks like this:

https://www.dosomething.org/us/openid-connect/northstar?code=U91hsBpSrlLVvr4afvUl1pJu98K8xhUwSrun3Jz5tj4GhW2ESODPR6Pnp2skIBeixd0QRwv8aTAWPIEn8a2hGD12wvE%2FqJdUENdhr1L%2B6Mbwhjb7HNepHSMrFmrsA6ISt5fHwMIK7hcjVrS42ZiNZOVarpO1g9Lz7ObZOfNLKnApPxopcJM7aUFlBXhyQPmEzRVZtzQs9J7MLSyb0VNhotsYf2Uulr4gtEtcxJr%2FpGqy8l9wWpDb0Nz%2FLbR8HxCG7TQ51%2FwooeuXHcLW40lwasIH5jVNTgYn07zR3JgG82GLVXRlkA26S0xP8bXZdotkBaO5pgh9poeMjV5qjZzAPp0uOW9jsTHgmEwiRZudhSQueTv3dhM6gRTXwQAN2vpOlrKB9pCFMfsHYhZtJuPY%2BvJ8%2Bcaz1jhqbnS9GovEtua8rx3qFC%2BUMLZeVZeAlYAJ%2Fz9jysElOcbyhGymI5WdI8bRvHHNXe6yuOPDd4dzcje6TOftodHTRdfskRxVIHzGXxnOkzawhWGyOtN5%2BDdVtHuSBBD5Z6KwbD3E472JtgrPaF%2Bd%2F2laboxUbfrRyVTWEx0LRQGxAjn%2BZjxJyk2%2BX%2B1r5OdjshbIG5ycAkZJwDQFZ5Ghnn1RQqBenxu%2Bso8vo0vwOYWdfiaDLy4yUgkOEnLtbSRve8LDS2MZmeVa76s%3D&state=d09a991bc59998851e027f321a1b2f9d

Desired Behavior

Users should be able to log in successfully, or a more useful error should be shown.

Steps to Replicate

@jessleenyc has been able to reliably replicate the problem when logging in to this intern's account on the production website: 584b130fa0bfad58c2231c02.

Why This Matters

Users can't log in, and are not provided with a useful error message.

Countries Affected (optional)

…

OS/Browser (optional)

…

Relevant Screenshots + Links

DFurnes commented 7 years ago

It could be refusing to allow the user to log in if their state token isn't correct and they fail the access check. Maybe there's an issue where this user's session is getting lost during the redirect? Perhaps if they get routed to a different webhead?

DFurnes commented 7 years ago

I just tried manually using an authorization code from a recent 404ing URL, and Northstar says the code is revoked. That seems to indicate that either the returned code never worked or was already used by Phoenix by the time the page 404ed... πŸ’£

DFurnes commented 7 years ago

Just confirmed that this is specifically a Phoenix issue, Sarah was able to log into Aurora just fine. πŸ’§

DFurnes commented 7 years ago

Looking at the newly added logging when trying to log in to Sarah's account… nothing looks out of place on the Northstar end & no unexpected exceptions are triggered:

[2017-02-08 19:50:36] production.DEBUG: created new auth code {"code":"36acf8165470dd3d20084e7cec535e59d0ce674f5d070f7c42b3c8f47b013d73e5f014fd40fa4029","user_id":"584b130fa0bfad58c2231c02","client_id":"phoenix-oauth"} 
[2017-02-08 19:50:36] production.DEBUG: checked auth code {"code":"36acf8165470dd3d20084e7cec535e59d0ce674f5d070f7c42b3c8f47b013d73e5f014fd40fa4029","exists":true} 
[2017-02-08 19:50:36] production.DEBUG: revoked auth code {"code":"36acf8165470dd3d20084e7cec535e59d0ce674f5d070f7c42b3c8f47b013d73e5f014fd40fa4029"} 
DFurnes commented 7 years ago

More context from the Trello card:

Sarah's account is fixed now, from changing the email on her old Phoenix UID and re-creating a new one via the normal OpenID Connect flow. I need to do a little more looking but I think that should help figure out the issue for the other affected accounts.

So I don't think it's related to a bad auth code… it seems like there was some sort of issue resolving the existing Drupal ID and saving Northstar fields to it (and hence why deleting the old user was the fix). Hopefully the updates in #7320 do the trick.

DFurnes commented 7 years ago

This is unfortunately still happening. I just did a audit of places where we'd end up at 404 page:

  1. openid_connect_redirect_access – if the state provided in query string doesn't match session
  2. openid_connet_redirect_page:34 – if client doesn't exist, or ?code is not provided in URL
  3. openid_connect_redirect_page:57 – if couldn't get access token, will call drupal_goto('user') which would just bring you to the "login" page...
  4. openid_connect_redirect_page:59 – If anything is wrong with user info response, it'll show the "Logging in with Northstar could not be completed due to an error." message.

It still seems like the most reasonable explanation is that the state token doesn't match, but I'm having a hard time figuring out how that'd ever happen to a well-intentioned user.

DFurnes commented 7 years ago

πŸ› As a member, I want to be able to log into DS so that I can do things that authenticated users can do πŸ›

DFurnes commented 7 years ago

I've spot checked each of the accounts that Hannah had flagged with this issue in Zendesk. All of them have active accounts & correctly synced data for email, northstar_id, and drupal_id in both Northstar and Phoenix.

A little interesting: one account shows user_registration_source as ` (empty!) and [another](https://www.dosomething.org/us/user/9033150) showsquasar-etl-node`... whaaaaaat? Super weird, but not sure it's relevant since the rest is okay.

DFurnes commented 7 years ago

I merged in a patch to the OpenID Connect module in #7341 which will allow us to see whether the issue is the state token (my current hunch!) – it'll log the expected/provided values & users will be given a flash message so we can identify the problem in support requests.

DFurnes commented 7 years ago

Some issues were reported with the "We couldn't verify this login request" message being displayed unexpectedly (when changing an image on a campaign node). This is suuuuper strange since that should only be running on the access check for the authentication page.

Noting here, but gonna assume this is a πŸ‘» for now...

DFurnes commented 7 years ago

The error message was also displayed at least once during a Ghost Inspector login test. Neither me nor @ngjo were able to reproduce, and the same issue hasn't occurred since...

ngjo commented 7 years ago

@DFurnes For the issues that were reported when changing the image on the campaign node, it continued to happen for certain folks but not me. I don't know if it's at all related, but here's the profiles:

Anyways, Chris's profile (he got the "we couldn't verify this login request"): https://www.dosomething.org/us/user/3984224/edit

Calvin (also got the message): https://www.dosomething.org/us/user/571235/edit

Mine (I didn't get that message and was able to change the image): https://www.dosomething.org/us/user/3912425/edit

ngjo commented 7 years ago

@DFurnes We got another person who said they can't log into their account: UID - 1249515

This is what they're seeing after they try to login

screen shot 2017-04-18 at 6 07 10 pm

ngjo commented 7 years ago

@DFurnes same user UID - 1249515 is trying to login and this is the URL they're getting redirected to:

https://www.dosomething.org/us/openid-connect/northstar?code=U91hsBpSrlLVvr4afvUl1pJu98K8xhUwSrun3Jz5tj4GhW2ESODPR6Pnp2skIBeixd0QRwv8aTAWPIEn8a2hGD12wvE%2FqJdUENdhr1L%2B6Mbwhjb7HNepHSMrFmrsA6ISt5fHwMIK7hcjVrS42ZiNZOVarpO1g9Lz7ObZOfNLKnAOfsi2kbvXtUNJYKXXOZQr95g9%2B%2BfIpg9grJz9tCB9P8M60FwF30tWU2cav45H22gtQzg0zWh2%2FCBvtYY5W56FdExgvgyl3u%2B7aGsMKq2FlBxpIPs2mQcZ%2F0%2FGfbbNwR4wDR5Yv%2Bu09pc4TOvl02HLbmSFrOuADGhMscGEqHVrAFhCe8RBbu2Iew77t4RPRu7QDzgXDTzLQj%2F%2BtBIxc5ha8mSDnx656VarUImuvwmWjDrXZH%2FTWIB9YA6CwcZy3pHKgiIjVgF%2B94TElNdserc0SzUTOt%2BrweBH4oQeHM%2Faj8hrFyDLBh0SjY2q%2FiEu%2FkfeQE9YMgxhDsU%2Bb4ZKgTveXxnOkzawhWGyOtN5%2BDdVtHuSBBD5Z6KwbD3E472JtgrPaF%2Bd%2F2laboxUbfrRyVTWEx0LRQGxAjn%2BZjxJyk2%2BX%2B1r5OdjshbIG5ycAkZJwDQFZ5Ghnn1RQqBenxu%2Bso8vo0vwOYWdfiaDLy4yUgkOEnLtbSRve8LDS2MZmeVa76s%3D&state=43d93f14583ad0620df7524cb09d00c0

with this error message in purple "We couldn't verify this login request. Please try again!"

The member went on to say that they couldn't even get to the login screen, they click the "login" button and then they immediately get that error message.

ngjo commented 7 years ago

And separately, documenting the issue w/ Gladiator login that I mentioned at sprint planning this morning.

Last week sometime, I was trying to log into Gladiator - I've previously been able to do this - and I couldn't login. Shae checked out my privileges in Aurora and it seemed like I wasn't an admin anymore. Not sure how/when this happened.

A separate issue that came up for Hillary as she was trying to log into Gladiator was that she had to log out of northstar and then log back in (Shae provided these instructions for Hillary: Hit https://northstar.dosomething.org/logout and then reload gladiator.dosomething.org and login again)

Shae also added that she usually run into it when logging in and out of different apps that use northstar login

DFurnes commented 7 years ago

Some stream-of-consciousness debugging for this issue:

same user UID - 1249515 is trying to login and this is the URL they're getting redirected to:

Interesting, looking at the log message that we added for the state error):

The state token did not match: , 43d93f14583ad0620df7524cb09d00c0

A couple interesting things to note:

  1. This is happening in the openid_connect_redirect_access check like hypothesized above.
  2. The @expected value from $_SESSION when we show the "404 page" to the user was blank.
  3. The auth code in this response is shown as granted & correctly "used" in the Northstar logs:
[2017-04-24 18:20:48] production.DEBUG: created new auth code {"code":"dbd60d66ee7bd61b8b5c14586458ed4e651492250e63eaa1ae4344e45a763e5c33539077f9e5f1ce","user_id":"55a171c1469c64ef3d8b9c28","client_id":"phoenix-oauth"} 
[2017-04-24 18:20:48] production.DEBUG: checked auth code {"code":"dbd60d66ee7bd61b8b5c14586458ed4e651492250e63eaa1ae4344e45a763e5c33539077f9e5f1ce","exists":true} 
[2017-04-24 18:20:48] production.DEBUG: revoked auth code {"code":"dbd60d66ee7bd61b8b5c14586458ed4e651492250e63eaa1ae4344e45a763e5c33539077f9e5f1ce"}

That last part is interesting since it seems like we shouldn't ever get to the "revoke auth code" step if we've already failed in the openid_connect_redirect_access access check for that page.

This makes me think we're somehow double-redirecting to this page... the first time we go through openid_connect_redirect_page and unset the state and check/revoke the code, but then we somehow just redirect to the same page without the state value in the session.

Also not sure I'm reading this right, but there's a surprising number of entries for that code in Logstash&_a=(columns:!(_source),index:'filebeat-',interval:auto,query:(query_string:(analyze_wildcard:!t,query:%22%3Fcode%3DU91hsBpSrlLVvr4afvUl1pJu98K8xhUwSrun3Jz5tj4GhW2ESODPR6Pnp2skIBeixd0QRwv8aTAWPIEn8a2hGD12wvE%252FqJdUENdhr1L%252B6Mbwhjb7HNepHSMrFmrsA6ISt5fHwMIK7hcjVrS42ZiNZOVarpO1g9Lz7ObZOfNLKnAOfsi2kbvXtUNJYKXXOZQr95g9%252B%252BfIpg9grJz9tCB9P8M60FwF30tWU2cav45H22gtQzg0zWh2%252FCBvtYY5W56FdExgvgyl3u%252B7aGsMKq2FlBxpIPs2mQcZ%252F0%252FGfbbNwR4wDR5Yv%252Bu09pc4TOvl02HLbmSFrOuADGhMscGEqHVrAFhCe8RBbu2Iew77t4RPRu7QDzgXDTzLQj%252F%252BtBIxc5ha8mSDnx656VarUImuvwmWjDrXZH%252FTWIB9YA6CwcZy3pHKgiIjVgF%252B94TElNdserc0SzUTOt%252BrweBH4oQeHM%252Faj8hrFyDLBh0SjY2q%252FiEu%252FkfeQE9YMgxhDsU%252Bb4ZKgTveXxnOkzawhWGyOtN5%252BDdVtHuSBBD5Z6KwbD3E472JtgrPaF%252Bd%252F2laboxUbfrRyVTWEx0LRQGxAjn%252BZjxJyk2%252BX%252B1r5OdjshbIG5ycAkZJwDQFZ5Ghnn1RQqBenxu%252Bso8vo0vwOYWdfiaDLy4yUgkOEnLtbSRve8LDS2MZmeVa76s%22)),sort:!('@timestamp',desc))), whereas a test login I just made only shows 2 entries&_a=(columns:!(_source),index:'filebeat-',interval:auto,query:(query_string:(analyze_wildcard:!t,query:%22U91hsBpSrlLVvr4afvUl1pJu98K8xhUwSrun3Jz5tj4GhW2ESODPR6Pnp2skIBeixd0QRwv8aTAWPIEn8a2hGD12wvE%252FqJdUENdhr1L%252B6Mbwhjb7HNepHSMrFmrsA6ISt5fHwMIK7hcjVrS42ZiNZOVarpO1g9Lz7ObZOfNLKnBI3sgF9MhJ2imK8FNCk2buhum2ScntBwojHJUb24UiFquN8LB9IfhNWiTi1caxBdT7WF3FaxaLG2If4t8jqhfhMIjcmOLGIHqb8l0f6IqjGib%252FEIX1NdIF3IBpJPaJkRNBwAlQvIDha8DYDW%252B6yD%252B0YfyPMmbjAC%252FgeIvXxbM1BXE3er4OL9DhUq%252FayIDzB2k381bRVZOLtdwXLHOe2dGBskvZHeeAeJsddhqIR4ZVheN1WBqZFziW5gSTr0vI5PpyYs53XrHLaw22FtsAbltIESchHzPy6VQpOb9tx0zsRi3P9iiI6Wm%252FNaK2ufaYiOjFSZgob%252FHC%252FBaGTWXHDmw9XxnOkzawhWGyOtN5%252BDdVtHuSBBD5Z6KwbD3E472JtgrPaF%252Bd%252F2laboxUbfrRyVTWEx0LRQGxAjn%252BZjxJyk2%252BX%252B1r5OdjshbIG5ycAkZJwDQFZ5Ghnn1RQqBenxu%252Bso8vo0vwOYWdfiaDLy4yUgkOEnLtbSRve8LDS2MZmeVa76s%253D%22)),sort:!('@timestamp',desc))) (one of type nginx-access and one syslog).

Putting this down for now, but feels like we're getting closer...πŸ•΅οΈ

DFurnes commented 7 years ago

A ha! There's an error for that same user when updating their local profile fields: Data too long for column 'field_address_country' at row 1. It turns out this user has the full string "United States" saved in their profile, but the column in Drupal is only 2 characters long.

A few things we can do here:

  1. Find a way to make sure unhandled exceptions like this are surfaced! I see the issue now in New Relic (105 occurrences over the last month), but it's a bummer it went unsolved for so long.
  2. Add better exception handling to the OpenID Connect module so that an error when "syncing" profile data into the Drupal database doesn't stop the user from logging in entirely. (#7361)
  3. Ensure that field is better formatted on the Northstar end. (DoSomething/northstar#563)
  4. (Short Term) Fix this manually for users who are affected.

I've checked out each of other accounts that submitted help tickets (1, 2, 3, 4, 5, 6) and all of them (except one!?) have entered longer text into their country field than that column would allow in Phoenix.

DFurnes commented 7 years ago

@hbghidey Both the Phoenix & Northstar code fixes have been deployed, so this should no longer be happening to anyone! Let me know if you receive any more reports. πŸ˜„

hbghidey commented 7 years ago

@DFurnes A beautiful day! Thanks for all of your work on this. I'll let the team know.