Should get_revocation_key() use User.password as input?

aaugustin / django-sesame

"Magic Links" - URLs with authentication tokens for one-click login

https://django-sesame.readthedocs.org/

BSD 3-Clause "New" or "Revised" License

985 stars 57 forks source link

Should get_revocation_key() use User.password as input? #40

Closed tszynalski closed 4 years ago

tszynalski commented 5 years ago

Hi Aymeric,

First of all, thank you very much for creating and maintaining django-sesame and making it freely available. Fantastic piece of code that exactly solves my problem.

I've been investigating the way tokens are generated and I'm wondering if it wouldn't be more secure to use only the password hash as input to the multiple-iteration MD5 function.

https://github.com/aaugustin/django-sesame/blob/f21003da843c0ebfbb1601be9bf37994e54f9bfa/sesame/backends.py#L106

Currently get_revocation_key() uses User.password, which is of the form:

<algorithm>$<iterations>$<salt>$<hash>

Notice that the user's salt (a sensitive piece of data) is included there, unencrypted. Everything hinges on the security of PBKDF2/MD5. If that algorithm is ever reversed (I realize this is not very likely), an attacker could obtain both the salt and the hashed password. This would then make it fairly easy to dictionary-attack the password.

Now if get_revocation_key() only used the <hash> part, reversing PBKDF2/MD5 would only give you the password hash. And, of course, if you only know the password hash, but not the salt, the password is much much much harder to crack (10^20 times more work!). (In fact, it seems if you only used the hash part, you could get away without MD5 -- just include the hashed password in the token. The extra MD5-ing wouldn't make much of a difference compared to the extra 10^20 factor.)

Resistance to dictionary attacks would appear to be the same (assuming the attacker has an expired token and wants to guess the password), because the password hash is based on the password and the salt.

Of course, I know I can change MD5 to something else. But it seems so easy to eliminate an (admittedly theoretical) vulnerability. A salt is sensitive data and here it seems unnecessary to include it in the token.

Please don't hesitate to let me know if I'm missing something. I'm not a security expert, so it's quite possible (even likely) I screwed up somewhere :)

aaugustin commented 5 years ago

This is an excellent point.

The main issue with changing the hashing algorithm is that it invalidates all previously generated tokens... which is a big issue for existing users with no very good solution.

aaugustin commented 5 years ago

At least the comment should be updated to better reflect what's going on here.

aaugustin commented 5 years ago

One solution would be to add a version to the hashing schemes and let people stick with older schemes with a setting. Out of the box, they should get the newest scheme.

tszynalski commented 5 years ago

Glad you like my suggestion.

The "legacy" setting seems like a good fix. I suspect most users don't use long-lived tokens (personally, I'm only interested in short-lived login tokens), but those who do will be able to stick to the old method.

Alternatively, the token itself could indicate the version of algorithm to use, but I'm not sure the benefits would justify the added complexity.

aaugustin commented 4 years ago

To the best of my knowledge, MD5 is "broken" in the sense that reversing MD5("password chosen by a human") is doable with modern hardware. However, it's doable only because passwords chosen by humans have relatively low entropy, let's say 30 to 50 bytes.

Here, we're starting with a lot more entropy. With Django's current default hasher, PBKDF2PasswordHasher, User.password contains about 327 bits of entropy:

<algorithm> = 0 bits
<iterations> = 0 bits
<salt> = ~71 bits per the docstring of BasePasswordHasher.salt
<hash> = 256 bits

A search space of 2 ^ 327 possibilities feels large enough to be safe. Even if MD5 is very fast to compute with dedicated hardware, and thus isn't a good choice for PBKDF2 in general, I don't think it is 2 ^ 30 (1 billion) times faster than SHA256. And even if it was, we'd still have about 2 ^ 150 bytes of headroom. That's a multiple of the cost and time for a successful attack.

Thus, to the best of my understanding, and unless MD5 is more broken than I believe it is, from a practical perspective, we're safe.

For this reason, considering that the change would have a cost to users, I'm not going to make it.

I'll take this opportunity to document the SESAME_SALT, SESAME_DIGEST and SESAME_ITERATIONS settings, though.

aaugustin commented 4 years ago

PS - surely the current hashing scheme won't stand forever, so I'll need versioning at some point, but I don't think there's a compelling reason to do that yet.

tszynalski commented 4 years ago

Well, I don't think you can say it's 2^327, because if you guess the salt (71 bits), the hash only has as much entropy as the original password. You can't just add the entropies of the salt and the hash, because the hash is not independent of the salt. So I think it's closer to 71 bits + whatever the entropy of the password is (let's say 20 bits). Of course, that's still a lot of cracking, and you have to consider that you'd have to do PBKDF2/SHA256 there as well (to compute the hash), not just PBKDF2/MD5. That's about 6 times slower. All in all, plenty of work for a potential cracker... Even if it was just MD5 (not PBKDF2/MD5), a machine with 8 x GTX 1080 can only do 2^62 hashes a year.

However, all that assumes we're brute-forcing. I was talking about a theoretical scenario where someone actually finds a way to reverse MD5, so that it's no longer required to brute-force through all possible salt+hash combinations, but instead (by some mathematical trick) radically narrow down the search space. According to Wikipedia, the current best published algorithm can reduce the search space by 2^5, which is not too much. It's possible something better exists, but is kept secret.

So yeah, of course, I understand this is all quite hypothetical. I only made my suggestion because I thought it was easy to fix this theoretical vulnerability (since you don't have to use the salt as input to generate tokens, just don't use it, then it cannot be cracked).

I'm not sure I have a good grasp of what the "cost to users" is. If you only enabled the new token-generating algorithm on new installations via a config setting, it wouldn't be a problem for current users, would it? Sorry if I missed something...

Thanks again for your work on django-sesame.

aaugustin commented 4 years ago

Making the new algorithm opt-in means that no one (other than you) will use it in practice. It's interesting only if it it's opt-out, in which case most current users need a setting to keep the old algorithm.

However, since my last comment, I had an idea for changing the algorithm while preserving backwards compatibility — i.e. tell apart old and new tokens and verify them appropriately. I'm planning to work on this.

So, reopening :-)

tszynalski commented 4 years ago

Awesome! Recognizing token types is certainly the ideal solution here. Thank you for reconsidering. I'm looking forward to the update.

aaugustin commented 4 years ago

The v2 token format in #53 implements your suggestion. If you have time to take a look at the implementation, let me know your thoughts!

tszynalski commented 4 years ago

Fantastic. Thank you for releasing this. Now your package is basically perfect, as far as I'm concerned :)

As you requested, I've had a look at the new token code and I don't see any vulnerabilities. I've already installed the new release and am testing it now.

I've got a few questions:

In the documentation, you emphasize that the new tokens are shorter. In your experience, are longer tokens a problem? I ask because I plan to use some longer tokens in my app as well. Should I be worried?

Another thing I'm curious about is why last_login is in the revocation key. Couldn't one just check if the last login time is later than the timestamp of the token? If the user logged in after the token was produced, then reject the token. Am I missing something?

Finally, I came across this comment in the code:

Since we don't include the revocation key in the token, we need to fetch the user in the database before we can verify the signature.

I'm not sure I get it. I thought you always have to fetch the user from the database when verifying a token, because you have to calculate the revocation key from the password, in case the password changed since the token was produced. Sorry if I'm missing something, I'm new to Django.

Thanks again for implementing my suggestion. Great work.

aaugustin commented 4 years ago

In your experience, are longer tokens a problem?

Longer tokens in URLs are:

Vulnerable to truncation: I suspect line wrapping can truncate tokens, especially when forwarding email. Perhaps this is less of a concern than it used to be, since plain text emails are fading in favor of HTML emails, and email clients may also be getting better at not breaking links.
Ugly: you're welcome not to care about this :-)

Another thing I'm curious about is why last_login is in the revocation key.

This is only when SESAME_ONE_TIME is enabled. By default, it isn't, and the docs recommend against it. A short SESAME_MAX_AGE is better in many use cases.

Couldn't one just check if the last login time is later than the timestamp of the token?

Yes, that would work for timestamped tokens. To be honest, I didn't consider this possibility.

The current implementation uses the same code for timestamped and non-timestamped tokens (i.e. regardless of whether SESAME_MAX_AGE is set or not), which seems like a good reason for keeping it that way.

If the user logged in after the token was produced, then reject the token. Am I missing something?

I suspect you missed this only happens when SESAME_ONE_TIME = True.

I thought you always have to fetch the user from the database when verifying a token, because you have to calculate the revocation key from the password, in case the password changed since the token was produced.

Yes, I do, for valid tokens.

However, I could reject some invalid tokens earlier if I could perform the integrity check before fetching the user. Then I'd only fetch the user — a slightly expensive operation — after I know the token is legit. Tokens v1 worked that way.

For most Django applications, this isn't a concern because the login form has exactly the same behavior: if an attacker sends a request with a user identifier (PK for django-sesame, username for the login form), the app makes a database query to fetch this user, and if it's found, does some crypto with the password.

So, even though the behavior of tokens v2 is slightly worse than v1 in this regard, I think it's in line with the expectations of most Django apps in terms of exposed surface.

tszynalski commented 4 years ago

Thank you very much for your answers.

However, I could reject some invalid tokens earlier if I could perform the integrity check before fetching the user. Then I'd only fetch the user — a slightly expensive operation — after I know the token is legit. Tokens v1 worked that way.

I guess what I don't understand is how the presence of a revocation key in the token helps check token integrity. Doesn't the revocation key look like a random sequence of characters? Do you mean things like incorrect length of the revocation key?

The current implementation uses the same code for timestamped and non-timestamped tokens (i.e. regardless of whether SESAME_MAX_AGE is set or not), which seems like a good reason for keeping it that way.

I agree, if you want to optimize token length for installations with SESAME_ONE_TIME = False, then doing it your way is better because you don't have to include the timestamp in the token, which makes it shorter. I had thought SESAME_ONE_TIME = True is the usual setting -- in that case it's slightly better just to check the timestamp against the DB.

Speaking of SESAME_ONE_TIME, I'm having a bit of hard time believing that some mail providers click on URLs in emails in order to scan them for threats. The reason for my doubt is that some email newsletters include URLs that instantly unsubscribe you. I'm sure you've seen them. That would mean that those mail providers unsubscribe you from your mailing lists. Some sites that send out email updates also use those one-click unsubscribe links (Kickstarter is one example). There are also many sites that send you email verification URLs – those would also get clicked automatically.

Do you have experience of this automatic link-clicking happening, or do you advise against SESAME_ONE_TIME = True out of an abundance of caution? I'd be grateful for your input.

aaugustin commented 4 years ago

I guess what I don't understand is how the presence of a revocation key in the token helps check token integrity. Doesn't the revocation key look like a random sequence of characters? Do you mean things like incorrect length of the revocation key?

For tokens v1: django-sesame compares a hash of data found in the token (= primary key + optional timestamp + revocation key) with the signature of the token. If the signature is invalid, perhaps because an attacker is trying to forge tokens, django-sesame can reject the token without any information besides what's in the token.

For tokens v2: django-sesame compares a hash of data found in the token (= primary key + optional timestamp) + data found in the database (= the revocation key) with the signature of the token. If the signature is invalid, django-sesame needs to make a database query before it can reject the token.

I agree, if you want to optimize token length for installations with SESAME_ONE_TIME = False, then doing it your way is better because you don't have to include the timestamp in the token, which makes it shorter. I had thought SESAME_ONE_TIME = True is the usual setting -- in that case it's slightly better just to check the timestamp against the DB.

I don't believe SESAME_ONE_TIME = True to be the usual setting. It's too fragile. For example, users may click a link on mobile, the page will begin loading, then get stuck and the token will be invalidated...

Speaking of SESAME_ONE_TIME, I'm having a bit of hard time believing that some mail providers click on URLs in emails in order to scan them for threats. The reason for my doubt is that some email newsletters include URLs that instantly unsubscribe you. I'm sure you've seen them. That would mean that those mail providers unsubscribe you from your mailing lists. Some sites that send out email updates also use those one-click unsubscribe links (Kickstarter is one example). There are also many sites that send you email verification URLs – those would also get clicked automatically.

You're right. Maybe there are heuristics to avoid these problems.

Do you have experience of this automatic link-clicking happening, or do you advise against SESAME_ONE_TIME = True out of an abundance of caution? I'd be grateful for your input.

It's certainly happening. It was reported here and I was able to reproduce it reliably with my work email account: https://github.com/aaugustin/django-sesame/issues/45#issuecomment-626384640

tszynalski commented 4 years ago

For tokens v1: django-sesame compares a hash of data found in the token (= primary key + optional timestamp + revocation key) with the signature of the token.

Ah, yes, of course. In tokens v2, you can't calculate the signature from the payload. I missed that. Thanks.

I don't believe SESAME_ONE_TIME = True to be the usual setting. It's too fragile. For example, users may click a link on mobile, the page will begin loading, then get stuck and the token will be invalidated...

But the session will be created and as long as the Set-Cookie header gets correctly transmitted to the user (which is likely, since headers come first), on subsequent page loads, the user will be correctly logged in, won't they?

Still, I take your point about fragility, just not sure how big of a problem it is. Are you aware of any serious security vulnerabilities related to having multi-use login tokens? What's the attack vector? Someone taking a photo of your screen as you click the login link?

Also, I did some research and apparently Office 365 even has a mail scanner that follows links and executes JavaScript.

One idea to stop scanners from logging in (based on this post) is to check if the user has a cookie (for example, the csrf cookie, which I think is set by every response in django?). If yes, log them in immediately. If not, show a login button that has to be clicked. The idea here is that the user will have (typically) visited your site on the same browser, so they will have the cookie when they click the login link. But an email scanner won't have that cookie. At first glance, it should work for 99% of users, the other 1% will have to make an extra click.

aaugustin commented 4 years ago

But the session will be created and as long as the Set-Cookie header gets correctly transmitted to the user (which is likely, since headers come first), on subsequent page loads, the user will be correctly logged in, won't they?

On subsequent page loads in the same browser, yes.

However, if the user needs to switch to another device for any reason, or if the connection breaks before they get the response headers, they need a new token. This may or may not be something you care about, depending on what you're using django-sesame for.

Still, I take your point about fragility, just not sure how big of a problem it is. Are you aware of any serious security vulnerabilities related to having multi-use login tokens? What's the attack vector? Someone taking a photo of your screen as you click the login link?

Multi-use tokens are the default. I'm not aware of serious concerns with for typical use cases of django-sesame. That's why I recommend against single-use tokens. I suggest short lived tokens instead.

Of course, it depends on what the token protects, how it's transmitted to users, and most importantly how long it's valid — hence "short lived".

Also, I did some research and apparently Office 365 even has a mail scanner that follows links and executes JavaScript.

Exactly, that's the problem. Thanks for the link. Really useful information.

One idea to stop scanners from logging in (based on this post) is to check if the user has a cookie (for example, the csrf cookie, which I think is set by every response in django?). If yes, log them in immediately. If not, show a login button that has to be clicked. The idea here is that the user will have (typically) visited your site on the same browser, so they will have the cookie when they click the login link. But an email scanner won't have that cookie. At first glance, it should work for 99% of users, the other 1% will have to make an extra click.

This can work for some use cases e.g. a login-by-email process — every login is like a password recovery.

I don't think it generalizes well to all use cases. Quite often, the point of django-sesame is to authenticate users or devices that never interacted with the website. If they did, they'd be authenticated already. So I don't plan to build this into django-sesame.

You can easily do it with a custom middleware: if there's a sesame parameter in the query string and no csrf cookie, return a HTTP 400.

tszynalski commented 4 years ago

Quite often, the point of django-sesame is to authenticate users or devices that never interacted with the website.

That's a good point.

BTW, I've thought of another point in favor of adding a cookie check in a login-by-email scenario: Mail scanners clicking links will create pointless django sessions, which may clutter up your django_sessions table and your memcached (especially if you use long-lived sessions). The more I think about it, the more I like the cookie check + button fallback idea.

Thank you for tokens_v2 and for this discussion, that was a really good experience for me.