generate_pattern_grid overcounts multiples of a correct letter

wfaulk commented 2 years ago

If a "guess" contains a letter more than once, and that letter appears in the "answer", then it counts that letter however many times it exists in the guess, regardless of how many times it appears in the answer.

Assume the guess is ayyyy and the answer is zazzz. We should expect to get 🟨⬛⬛⬛⬛. If we run that through generate_pattern_grid, we get:

>>> generate_pattern_grid(('ayyyy','zazzz'),('ayyyy','zazzz'))
array([[242,   1],
       [  3, 242]], dtype=uint8)

Our expected result must be one of those values.

242₁₀ == 11111₃ 1₁₀ == 00001₃ 3₁₀ == 00010₃

It appears that the digit-endianness here is opposite "normal" usage. If the assumed guess and answer were swapped, then the result would be ⬛🟨⬛⬛⬛, and that seems to match with 00010, but let's verify quickly that we've got it the right way around:

>>> generate_pattern_grid(('ayyyy',),('zazzz',))
array([[1]], dtype=uint8)
>>> generate_pattern_grid(('ayyyy',),('zzazz',))
array([[1]], dtype=uint8)
>>> generate_pattern_grid(('ayyyy',),('zzzaz',))
array([[1]], dtype=uint8)
>>> generate_pattern_grid(('ayyyy',),('zzzza',))
array([[1]], dtype=uint8)
>>> generate_pattern_grid(('ayyyy',),('azzzz',))
array([[2]], dtype=uint8)

That looks right. The first list is the guesses, and the second list is the answers.

Okay, now let's guess aaaaa with the answer being zazzz. We should expect that answer to be ⬛🟩⬛⬛⬛, or 00020₃, or 6₁₀. Let's see what the function says:

>>> generate_pattern_grid(('aaaaa',),('zazzz',))
array([[124]], dtype=uint8)

124₁₀, huh? That's 11121₃, or 🟨🟩🟨🟨🟨.

It appears that if a letter in the guess matches a letter in the answer, it gets to be yellow, no matter if it's been used before or not. Let's try something a little more subtle:

>>> generate_pattern_grid(('aayyy',),('zzzza',))
array([[4]], dtype=uint8)

4₁₀ == 00011₃ == 🟨🟨⬛⬛⬛, but it should be either 🟨⬛⬛⬛⬛ or ⬛🟨⬛⬛⬛.

Now lets try it with real words:

>>> generate_pattern_grid(('geese',),('camel',))
array([[93]], dtype=uint8)

93₁₀ == 10110₃ = ⬛🟨🟨⬛🟨, and it should only have one 🟨.

The problem seems to be here:

https://github.com/3b1b/videos/blob/ad8427a1eab6e17eb469f42122b8463f5c1a803f/_2022/wordle.py#L183-L185

I'm not sure how to fix this efficiently.

pak21 commented 2 years ago

I think I caught all the edge cases while developing my solver; there's a test suite here if that helps https://github.com/pak21/wordle-solver/blob/main/tests/test_signature.py

3b1b commented 2 years ago

Thanks! This was indeed a meaningful bug. I believe it's fixed now.

3b1b / videos

generate_pattern_grid overcounts multiples of a correct letter #23