danmermel / cryptario

Cryptic crossword solver
0 stars 0 forks source link

Speed up anagram solutions #128

Closed glynnbird closed 3 years ago

glynnbird commented 3 years ago

I postulate that our anagram solver is slow at the moment because the Node.js code has to load a large dictionary. If there were many more smaller dictionaries, it would be quicker.

There are currently 590598 solutions in our "combined.txt" file which produce the following JSON dictionaries:

-rw-r--r-- 1 daniel daniel  235 Jul 27 10:50 dictionary1.json
-rw-r--r-- 1 daniel daniel 4.2K Jul 27 10:50 dictionary2.json
-rw-r--r-- 1 daniel daniel  23K Jul 27 10:50 dictionary3.json
-rw-r--r-- 1 daniel daniel  85K Jul 27 10:50 dictionary4.json
-rw-r--r-- 1 daniel daniel 203K Jul 27 10:50 dictionary5.json
-rw-r--r-- 1 daniel daniel 409K Jul 27 10:50 dictionary6.json
-rw-r--r-- 1 daniel daniel 616K Jul 27 10:50 dictionary7.json
-rw-r--r-- 1 daniel daniel 807K Jul 27 10:50 dictionary8.json
-rw-r--r-- 1 daniel daniel 887K Jul 27 10:50 dictionary9.json
-rw-r--r-- 1 daniel daniel 807K Jul 27 10:50 dictionary10.json
-rw-r--r-- 1 daniel daniel 697K Jul 27 10:50 dictionary11.json
-rw-r--r-- 1 daniel daniel 569K Jul 27 10:50 dictionary12.json
-rw-r--r-- 1 daniel daniel 430K Jul 27 10:50 dictionary13.json
-rw-r--r-- 1 daniel daniel 304K Jul 27 10:50 dictionary14.json
-rw-r--r-- 1 daniel daniel 469K Jul 27 10:50 dictionary15.json

Note that the dictionaries for 6-15 letter solutions are over 0.5 MB in size.

Currently the anagram text is processed to lowercase, remove non alphabetic characters and sorted so that:

Taylor Swift! --> afilorsttwy

It is then saved in a JavaScript object, keyed on afilorsttwy where each dictionary is "sharded" on the number of letters in the solution, so this solution goes in the anagramSolutions11.json with all the other 11 letter solutions.

Solution 1 - more smaller dictionaries i.e. shard on something else

To make smaller sub-dictionaries, we could shard by the first three letters of the sorted string e.g.

which would result in a maximum of 17576 dictionary files (26^3), with the data more evenly distributed.

Solution 2 - use a database

We could turn combined.txt into a SQLite database which gets bundled into the Lambda image. This would require us to have a table (id, clue, solution), an index on the clue field and a big old CSV/TSV file containing clue->solution pairs which we derive from combined.txt which we bulk import into SQLite at deployment-time. (A SQLite database is just a file, so is easily bundled into a Lambda deployment).

At query time we would do SELECT * from solutions WHERE clue='afilorsttwy' against the bundled SQLite database.


A first step would be to measure invocation times and via logging, see where the slowness is - it surely must be in the the require of the large dictionary files.

glynnbird commented 3 years ago

I did some benchmarking in a Lambda function timing how long to require each JSON directory in turn. It's not that bad:

  | 2020-07-29T15:13:57.929+01:00 | 2020-07-29T14:13:57.918Z undefined INFO 0.003000020980834961
  | 2020-07-29T15:13:57.956+01:00 | 2020-07-29T14:13:57.929Z undefined INFO 0.010999917984008789
  | 2020-07-29T15:13:58.010+01:00 | 2020-07-29T14:13:57.956Z undefined INFO 0.02700018882751465
  | 2020-07-29T15:13:58.105+01:00 | 2020-07-29T14:13:58.010Z undefined INFO 0.053999900817871094
  | 2020-07-29T15:13:58.229+01:00 | 2020-07-29T14:13:58.105Z undefined INFO 0.09500002861022949
  | 2020-07-29T15:13:58.516+01:00 | 2020-07-29T14:13:58.228Z undefined INFO 0.12299990653991699
  | 2020-07-29T15:13:58.641+01:00 | 2020-07-29T14:13:58.516Z undefined INFO 0.2869999408721924
  | 2020-07-29T15:13:58.734+01:00 | 2020-07-29T14:13:58.640Z undefined INFO 0.12400007247924805
  | 2020-07-29T15:13:58.872+01:00 | 2020-07-29T14:13:58.733Z undefined INFO 0.09299993515014648
  | 2020-07-29T15:13:59.104+01:00 | 2020-07-29T14:13:58.872Z undefined INFO 0.13899993896484375
  | 2020-07-29T15:13:59.155+01:00 | 2020-07-29T14:13:59.104Z undefined INFO 0.23200011253356934
  | 2020-07-29T15:13:59.312+01:00 | 2020-07-29T14:13:59.155Z undefined INFO 0.05099987983703613
  | 2020-07-29T15:13:59.312+01:00 | 2020-07-29T14:13:59.312Z undefined INFO 0.15700006484985352

The numbers at the end are the time to load each anagramSolutionsX.json file in seconds.

glynnbird commented 3 years ago

Benchmarking SQLite3 loading 1 row from an indexed database with 600k rows:

Between 1 & 2 milliseconds.