WikiEducationFoundation / WikiEduDashboard

Wiki Education Foundation's Wikipedia course dashboard system
https://dashboard.wikiedu.org
MIT License
392 stars 630 forks source link

Student 'sign up' flow sometimes errors, sometimes loses passcode #609

Closed ragesoss closed 8 years ago

ragesoss commented 8 years ago

If a student goes through the 'sign up' flow to create a new account, the enroll passcode may not be retained when they make it all the way back through onboarding. The 'sign in' flow also seems to be inconsistent.

One time when I tested it (with an existing account that hadn't been onboarded), I ended up getting redirected like this after onboarding: https://dashboard-testing.wikiedu.org/users/auth/mediawiki/callback?oauth_verifier=&oauth_token=

That causes this type of error: http://sentry.ragesoss.com:8080/wiki-ed/production/group/1225/

I'm not sure where to start with this.

thenickcox commented 8 years ago

cc @bmathews

bmathews commented 8 years ago

It kind of relates to onboarding. If the sign up process gets weird then the enroll passcode won't be tracked up to the end (and thus won't finish the enrollment).

That stack trace doesn't really mean anything to me though, and the oauth_verifier and token do look populated in the request in that particular sentry error. If we could find a way to consistently reproduce or some more clues, then I could take a look?

thenickcox commented 8 years ago

I went to look at this and got a 503 for Sentry. Any ideas, @ragesoss?

ragesoss commented 8 years ago

@thenickcox that sentry server isn't the one we use anymore. Here's the most related entry on the new one: https://sentry.wikiedu.org/wikiedu/production/issues/28/

ragesoss commented 8 years ago

@thenickcox Starting from the top on this...

There were some other issues with onboarding that were more serious, but I think independent, that were on the Wikipedia end and that have since resolved. Basically, new accounts were getting an OAuth error when they tried to Allow the dashboard, because of changes to Wikipedia's core auth system where CentralAuth accounts were no longer being created for new users.

The JWT decode error is one that we've seen occasionally, since forever. Although it might be Wikipedia's fault, it's something that we should build some handling for — such as instructing the user to try logging in again, and/or having them start from the same enroll link that they tried originally, if applicable.

I just restarted my home sentry server, so that first link is accessible again. It's the same issue.

thenickcox commented 8 years ago

Hm…the question will be how to stub that server response in a test so we can do that. I'll try to look into that with the remaining hours I have resourced for Wiki this week.

ragesoss commented 8 years ago

@thenickcox sounds good. This may or may not be relevant: https://github.com/WikiEducationFoundation/WikiEduDashboard/blob/master/spec/features/student_role_spec.rb#L147

thenickcox commented 8 years ago

That's super helpful! Thanks!

thenickcox commented 8 years ago

As I work through this, the difficult part is it doesn't seem that the OAuth fails, which would redirect to /auth/failure, which OAuth comes with a way to stub for. The JWT piece is getting tricky.

thenickcox commented 8 years ago

I'll talk to @mattfordham about this when we have a code review this afternoon.

thenickcox commented 8 years ago

@ragesoss You did some development on the mediawiki oauth gem. Is there any chance that the issue could be due to this? What gets me is how your oauth_token query param got lost on redirect.

Also, because we don't handle interaction with JWT, it seems like it might not be possible to catch those errors. I'm still investigating.

ragesoss commented 8 years ago

I can't think of why that would break things only intermittently. Also, the JWT error has been happening since before we updated to my fork of the gem (which is when that bit got added). I think it's been happening occasionally since forever. (The old sentry issue says 3 months, but I think there was a different one from before we upgraded the gem the previous time.)

thenickcox commented 8 years ago

Still looking. The only way I can immediately think of to handle this specially is to catch and rethrow the JWT::DecodeError from your fork, which seems hairy. What about just adding some copy to the 500 page?

thenickcox commented 8 years ago

@ragesoss In the immediate, and without digging super far into the JWT gem, I'm wondering if adding some suggestions of things users might try on the 500 page would be an acceptable solution. I was talking it over with @mattfordham, and we can definitely go down the path of making the mediawiki gem use a local version of jwt, which we can put debugging statements in, but if we handle the error separately, how different, in reality, would that page look from the current 500 page? In other words, what would that give us that adding some suggestions to the 500 page wouldn't?

thenickcox commented 8 years ago

For the record, this does happen more often than I'm comfortable with, but it'll take a lot more digging to get to the bottom of it. And I can't immediately see what wrapping this error specifically would get us, and how different our handling of that would be from adding some suggestions to the 500 page.

ragesoss commented 8 years ago

Can we catch the JWT error in our code and serve a different error message?

-Sage

On Monday, January 25, 2016, Nick Cox notifications@github.com wrote:

For the record, this does happen more often than I'm comfortable with, but it'll take a lot more digging to get to the bottom of it. And I can't immediately see what wrapping this error specifically would get us, and how different our handling of that would be from adding some suggestions to the 500 page.

— Reply to this email directly or view it on GitHub https://github.com/WikiEducationFoundation/WikiEduDashboard/issues/609#issuecomment-174750023 .

thenickcox commented 8 years ago

@ragesoss The JWT error doesn't come into our code, because it's called from the mediawiki gem. I don't think it's possible to handle it in our code. That's what I meant about catching it in your fork of the mediawiki gem and rethrowing, which would give our code the opportunity to handle it.

Edit

And what error message would you like to display? It's possible we can have some logic in the controller for it.

thenickcox commented 8 years ago

Also, do you happen to remember the steps to reproduce the one time you tested it and got redirected without a token? Did you just go through onboarding normally? And what happened when it caused the error? Did you get taken to the 500 page?

ragesoss commented 8 years ago

I created a new account on dashboard-testing, via the 'create account' flow starting from an enroll link. It did send me through onboarding, if I remember correctly.

ragesoss commented 8 years ago

I don't remember if there was a 500.

thenickcox commented 8 years ago

@ragesoss What I'm attempting now is to override the devise controller to try to wrap the error raised by ruby-jwt, but I'm not 100% sure it'll work. If we're able to do that (without adding code to your mediawiki fork to catch the JWT error), we'll have the opportunity to handle the error. How would you like to handle the error?

ragesoss commented 8 years ago

'What message to display':

It looks like there was an problem with the sign in process. (Sorry about that! It happens occasionally.) Please try logging in or enrolling again.

ragesoss commented 8 years ago

In terms of handling it, I'd like to redirect to a specific error page for that issue, and also capture the error and send it to sentry... ideally with the specific data that choked JWT.

Once we understand what exactly is going on that causes the problem, maybe we can build some better handling for it into the oauth mediawiki gem. If it represents a common failure mode with the login process (rather than a bug in our dashboard) then that's probably where it makes sense to account for it (by packaging up the info about what went wrong and raising a more helpful error, for example).

thenickcox commented 8 years ago

Reporting back, overriding the Devise controller doesn't work because it doesn't hit that until it returns from the JWT decoding and, failing that, redirects to /sign_in, which I should've known.

(Edit: What this means is that it confirmed by suspicion that we can't directly handle the JWT error in any way I know of, because it raises before it gets back to our code.)

So now our only option is to build the error handling into your mediawiki gem. Is that how you want to proceed?

ragesoss commented 8 years ago

Yeah, I think so. In the short term, could we monkey patch the oauth class from within the devise initializer, and add a rescue there?

thenickcox commented 8 years ago

@ragesoss I've been working on that, but I'm having trouble because although I'm able to catch the exception in the custom OAuth strategy class, ApplicationController isn't handling it as a special exception, because it's treating it as a 500. Still working.

ragesoss commented 8 years ago

With 978a66bca55617d4e1eda32ceb68679ead5e6c72, we handle this a little better from a user experience perspective. We have caught one instance since pushing that to production, but it didn't include the data that caused the JWT decode error because the data got scrubbed by Sentry.

I added a Rails.logger call to catch the next one: 1e1ada0f3cfca03201b3073910ca0d68587f5f39

ragesoss commented 8 years ago

Second one didn't get scrubbed:

{"error":"mwoauthdatastore-access-token-not-found","message":"No approved grant was found for that authorization token."}

That's the kind of expected error that should be handled in the strategy.

thenickcox commented 8 years ago

I'm not sure I understand what you mean.

ragesoss commented 8 years ago

@thenickcox mean by what? See the two Sentry entries since yesterday: https://sentry.wikiedu.org/wikiedu/production/issues/166/

Those two may be different failure modes, but at least the latter, which had the above error, looks like it's a Wikimedia bug: https://phabricator.wikimedia.org/T124942

The first one says "jwt_datat: [Filtered]", so it may be a different cause.

thenickcox commented 8 years ago

@ragesoss Right. Okay. Just to clarify: is there any further action that you need from me on this?

ragesoss commented 8 years ago

@thenickcox Nope. Sorry, didn't meant to signal that there was an immediate problem... I was just trying to pull together the various threads related to this, as we learn more based on yesterday's patch. (Looks like a possible fix is ready at the Wikimedia end, as well.)

ragesoss commented 8 years ago

The normal way to handle failures in OmniAuth strategies is to call the fail!; the request then gets redirected to /auth/failure. I got that kind of working — the auth failure message reaches the rails console — but then it just continues right on like nothing went wrong.

ragesoss commented 8 years ago

I adjusted the approach and moved it into the forked gem with #629

ragesoss commented 8 years ago

With 3c05b274af58a36f268c36ff706ade91cb5f2e74, I think any remaining problems are at the mediawiki end.

thenickcox commented 8 years ago

:tada: