AutoIDM / wordle

wordle analytics
MIT License
4 stars 1 forks source link

Find best two words #3

Open visch opened 2 years ago

visch commented 2 years ago
WITH guesses AS (

SELECT
  word,
  SUBSTRING(word, 1, 1) letter_one,
  SUBSTRING(word, 2, 1) letter_two,
  SUBSTRING(word, 3, 1) letter_three,
  SUBSTRING(word, 4, 1) letter_four,
  SUBSTRING(word, 5, 1) letter_five
FROM public.wordle

), answers AS (

SELECT
  word,
  SUBSTRING(word, 1, 1) letter_one,
  SUBSTRING(word, 2, 1) letter_two,
  SUBSTRING(word, 3, 1) letter_three,
  SUBSTRING(word, 4, 1) letter_four,
  SUBSTRING(word, 5, 1) letter_five
FROM answer

), crossjoin AS (

SELECT
  g1.word AS g1,
  g2.word AS g2,
  answers.word AS answer,
  CASE
    WHEN answers.letter_one IN (g1.letter_one, g1.letter_two, g1.letter_three, g1.letter_four, g1.letter_five, g2.letter_one, g2.letter_two, g2.letter_three, g2.letter_four, g2.letter_five) THEN 1
    ELSE 0
  END AS a1_match,
  CASE
    WHEN answers.letter_two IN (g1.letter_one, g1.letter_two, g1.letter_three, g1.letter_four, g1.letter_five, g2.letter_one, g2.letter_two, g2.letter_three, g2.letter_four, g2.letter_five) THEN 1
    ELSE 0
  END AS a2_match,
  CASE
    WHEN answers.letter_three IN (g1.letter_one, g1.letter_two, g1.letter_three, g1.letter_four, g1.letter_five, g2.letter_one, g2.letter_two, g2.letter_three, g2.letter_four, g2.letter_five) THEN 1
    ELSE 0
  END AS a3_match,
  CASE
    WHEN answers.letter_four IN (g1.letter_one, g1.letter_two, g1.letter_three, g1.letter_four, g1.letter_five, g2.letter_one, g2.letter_two, g2.letter_three, g2.letter_four, g2.letter_five) THEN 1
    ELSE 0
  END AS a4_match,
  CASE
    WHEN answers.letter_five IN (g1.letter_one, g1.letter_two, g1.letter_three, g1.letter_four, g1.letter_five, g2.letter_one, g2.letter_two, g2.letter_three, g2.letter_four, g2.letter_five) THEN 1
    ELSE 0
  END AS a5_match
FROM guesses g1
CROSS JOIN guesses g2
CROSS JOIN answers

), count_answers AS (

SELECT
  g1,
  g2,
  answer,
  a1_match + a2_match + a3_match + a4_match + a5_match AS total
FROM crossjoin

), maths_agg AS (

SELECT
  g1,
  g2,
  SUM(total) sum,
  AVG(total) avg,
  stddev(total) stddev,
  MAX(total) max,
  MIN(total) min
FROM count_answers
GROUP BY g1, g2

), final AS (

SELECT
  *
FROM maths_agg

)
SELECT
  *
FROM final
order by avg desc
limit 200
visch commented 2 years ago

Good use case for an article on OLTP vs OLAP I believe.

matsonj commented 2 years ago

alright lets see what can be done here...

matsonj commented 2 years ago

this query as written is too large to execute on my local postgres box. will try to break it into small pieces and optimize.

matsonj commented 2 years ago

very tactically there are a couple of optimizations that should make this query much faster:

  1. Reduce guesses set to ignore any words with duplicate letters. Although accuracy will reduce, we should see a speed increase and frankly a better result.
  2. only join guesses to each other that have entirely different letter sets. i.e. unique 10 letter combinations.

As it relates to hand optimizing, I used the top 15 letters from this query to identify 3 words that meet that criteria: CLUED - PYRAN - MOIST. So I think finding the 10 letter combination with the most frequent "hit rate" and then using that find two word combinations inside would be most practical for two word optimization.

visch commented 2 years ago

The duplicate letters idea is good I think, It'd at least be curious to run them both out and see the impact.

I like the practicality point as well, at a certain point it's like how accurate can we actually get here?

The most interesting point from an accuracy perspective is https://notfunatparties.substack.com/p/wordle-solver which also has some good ideas, a commenter named Sam also has some interesting ones. There's much room here to improve!

matsonj commented 2 years ago

I'm first trying to get a query that will...run on my machine. haha.

here is what I have - broken into two parts.

1 - combining the top 500 matches* from the first word with the remaining words:

SELECT 
    g1.word as g1,
    g2.word as g2,
    g1.letter_one as g1_letter_one,
    g1.letter_two as g1_letter_two,
    g1.letter_three as g1_letter_three,
    g1.letter_four as g1_letter_four,
    g1.letter_five as g1_letter_five,
    g2.letter_one as g2_letter_one,
    g2.letter_two as g2_letter_two,
    g2.letter_three as g2_letter_three,
    g2.letter_four as g2_letter_four,
    g2.letter_five as g2_letter_five
from "guesses" g2
join "final_wordlist" g1 ON 
    g2.word not like '%' || g1.letter_one || '%' and
    g2.word not like '%' || g1.letter_two || '%' and
    g2.word not like '%' || g1.letter_three || '%' and
    g2.word not like '%' || g1.letter_four || '%' and
    g2.word not like '%' || g1.letter_five || '%'

2 - using the reduced list from above (615k combinations) to match against answers

SELECT
  guess.g1 AS g1,
  guess.g2 AS g2,
  answers.word AS answer,
  CASE
    WHEN answers.letter_one IN (g1_letter_one, g1_letter_two, g1_letter_three, g1_letter_four, g1_letter_five, g2_letter_one, g2_letter_two, g2_letter_three, g2_letter_four, g2_letter_five ) THEN 1
    ELSE 0
  END AS a1_match,
  CASE
    WHEN answers.letter_two IN (g1_letter_one, g1_letter_two, g1_letter_three, g1_letter_four, g1_letter_five, g2_letter_one, g2_letter_two, g2_letter_three, g2_letter_four, g2_letter_five ) THEN 1
    ELSE 0
  END AS a2_match,
  CASE
    WHEN answers.letter_three IN (g1_letter_one, g1_letter_two, g1_letter_three, g1_letter_four, g1_letter_five, g2_letter_one, g2_letter_two, g2_letter_three, g2_letter_four, g2_letter_five ) THEN 1
    ELSE 0
  END AS a3_match,
  CASE
    WHEN answers.letter_four IN (g1_letter_one, g1_letter_two, g1_letter_three, g1_letter_four, g1_letter_five, g2_letter_one, g2_letter_two, g2_letter_three, g2_letter_four, g2_letter_five ) THEN 1
    ELSE 0
  END AS a4_match,
  CASE
    WHEN answers.letter_five IN (g1_letter_one, g1_letter_two, g1_letter_three, g1_letter_four, g1_letter_five, g2_letter_one, g2_letter_two, g2_letter_three, g2_letter_four, g2_letter_five ) THEN 1
    ELSE 0
  END AS a5_match
FROM "guesses_2" guess
CROSS JOIN "answers" answers

I still might run into an out-of-memory exception here - this second query is quite large ~ 1.8B rows and my linux VM is quite small. so may need to reduce thread count + partition out the answer.

As it relates to the links above, taking a look at those to see if I can integrate any code.

/* limit is based on visual inspection of matches which shows a breakpoint around the top 500 letters. the assumption is that it is unlikely that both words in the ideal two-word combination will be outside the 500-word limit

matsonj commented 2 years ago

Ok so it works, but it takes about 2 hours to run. Need to take a look at indexing, given that the created crossjoin table is 1.4B rows. indexes on the base 500 words should work nicely.