dropbox / zxcvbn

Low-Budget Password Strength Estimation
https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/wheeler
MIT License
14.99k stars 932 forks source link

Computed guesses for user-input matches is oddly high #208

Open akuchling opened 7 years ago

akuchling commented 7 years ago

When I score a password with a user_inputs dictionary:

user_inputs = ['andrew', 'kuchling', 'phase', 'group']
result = zxcvbn('andrewkuchlingphasegroup', user_inputs)

This password is just four words from the user-inputs dictionary, so I'd expect the score to be fairly poor. But result.score ends up being 4, because guesses is 1000150000000, and guesses_log10 is 12.00006513928696. That seems a really high estimate: if I multiply together the guesses value in the sequence, that's just 13625000, or log10 ~ 7.

Is there a bug somewhere in most_guessable_match_sequence that miscalculates the value of guesses?

akuchling commented 7 years ago

I investigated this a bit more and begin to dimly understand the scoring function. There are two things making the score weirdly high.

1) for a dictionary match, the rank is used as the # of guesses. For a small user inputs list, the resulting ranks are small, likely smaller than MIN_SUBMATCH_GUESSES_MULTI_CHAR (currently 50), so estimate_guesses() will return MIN_SUBMATCH_GUESSES_MULTI_CHAR instead of a value between 1 and 4 in my example.

Playing around, perhaps it makes sense to special-case the user-inputs dict and just use the size of the user-inputs dictionary as min_guesses. With this change, my example's score goes from 4 to 3.

2) The dominating factor then becomes MIN_GUESSES_BEFORE_GROWING_SEQUENCE, called 'the additive D penalty term' in the paper. This factor is trying to model other shorter sequences that the attacker will have tried (from other password lists, or brute-force, or whatever), and is currently set to 10,000. This penalty is what's causing the number of guesses for a password made of user-input strings to grow. There's an _exclude_additive parameter to suppress this penalty that isn't part of the public API and isn't directly controllable.

It's hard to know what to do here. An attacker might start by trying combinations of user-specific inputs first, so maybe a sequence that's all (match=dictionary and dictionary_name=user-input) shouldn't ever be charged the D penalty term. But what do we do for a sequence like <user-input>, <2-digit-number>, <user-input>? Is this varied enough to make charging D reasonable, and how do we decide that?

We could run the scoring twice -- once with D applied and once without it -- and then take the lower score, but that also doubles the computational work required.