Computed guesses for user-input matches is oddly high

I investigated this a bit more and begin to dimly understand the scoring function. There are two things making the score weirdly high.

1) for a dictionary match, the rank is used as the # of guesses. For a small user inputs list, the resulting ranks are small, likely smaller than MIN_SUBMATCH_GUESSES_MULTI_CHAR (currently 50), so estimate_guesses() will return MIN_SUBMATCH_GUESSES_MULTI_CHAR instead of a value between 1 and 4 in my example.

Playing around, perhaps it makes sense to special-case the user-inputs dict and just use the size of the user-inputs dictionary as min_guesses. With this change, my example's score goes from 4 to 3.

2) The dominating factor then becomes MIN_GUESSES_BEFORE_GROWING_SEQUENCE, called 'the additive D penalty term' in the paper. This factor is trying to model other shorter sequences that the attacker will have tried (from other password lists, or brute-force, or whatever), and is currently set to 10,000. This penalty is what's causing the number of guesses for a password made of user-input strings to grow. There's an _exclude_additive parameter to suppress this penalty that isn't part of the public API and isn't directly controllable.

It's hard to know what to do here. An attacker might start by trying combinations of user-specific inputs first, so maybe a sequence that's all (match=dictionary and dictionary_name=user-input) shouldn't ever be charged the D penalty term. But what do we do for a sequence like <user-input>, <2-digit-number>, <user-input>? Is this varied enough to make charging D reasonable, and how do we decide that?

We could run the scoring twice -- once with D applied and once without it -- and then take the lower score, but that also doubles the computational work required.

dropbox / zxcvbn

Computed guesses for user-input matches is oddly high #208