Closed glynnbird closed 3 years ago
I did some benchmarking in a Lambda function timing how long to require
each JSON directory in turn. It's not that bad:
| 2020-07-29T15:13:57.929+01:00 | 2020-07-29T14:13:57.918Z undefined INFO 0.003000020980834961
| 2020-07-29T15:13:57.956+01:00 | 2020-07-29T14:13:57.929Z undefined INFO 0.010999917984008789
| 2020-07-29T15:13:58.010+01:00 | 2020-07-29T14:13:57.956Z undefined INFO 0.02700018882751465
| 2020-07-29T15:13:58.105+01:00 | 2020-07-29T14:13:58.010Z undefined INFO 0.053999900817871094
| 2020-07-29T15:13:58.229+01:00 | 2020-07-29T14:13:58.105Z undefined INFO 0.09500002861022949
| 2020-07-29T15:13:58.516+01:00 | 2020-07-29T14:13:58.228Z undefined INFO 0.12299990653991699
| 2020-07-29T15:13:58.641+01:00 | 2020-07-29T14:13:58.516Z undefined INFO 0.2869999408721924
| 2020-07-29T15:13:58.734+01:00 | 2020-07-29T14:13:58.640Z undefined INFO 0.12400007247924805
| 2020-07-29T15:13:58.872+01:00 | 2020-07-29T14:13:58.733Z undefined INFO 0.09299993515014648
| 2020-07-29T15:13:59.104+01:00 | 2020-07-29T14:13:58.872Z undefined INFO 0.13899993896484375
| 2020-07-29T15:13:59.155+01:00 | 2020-07-29T14:13:59.104Z undefined INFO 0.23200011253356934
| 2020-07-29T15:13:59.312+01:00 | 2020-07-29T14:13:59.155Z undefined INFO 0.05099987983703613
| 2020-07-29T15:13:59.312+01:00 | 2020-07-29T14:13:59.312Z undefined INFO 0.15700006484985352
The numbers at the end are the time to load each anagramSolutionsX.json
file in seconds.
Benchmarking SQLite3 loading 1 row from an indexed database with 600k rows:
Between 1 & 2 milliseconds.
I postulate that our anagram solver is slow at the moment because the Node.js code has to load a large dictionary. If there were many more smaller dictionaries, it would be quicker.
There are currently 590598 solutions in our "combined.txt" file which produce the following JSON dictionaries:
Currently the anagram text is processed to lowercase, remove non alphabetic characters and sorted so that:
Taylor Swift! --> afilorsttwy
It is then saved in a JavaScript object, keyed on
afilorsttwy
where each dictionary is "sharded" on the number of letters in the solution, so this solution goes in theanagramSolutions11.json
with all the other 11 letter solutions.Solution 1 - more smaller dictionaries i.e. shard on something else
To make smaller sub-dictionaries, we could shard by the first three letters of the sorted string e.g.
which would result in a maximum of 17576 dictionary files (26^3), with the data more evenly distributed.
Solution 2 - use a database
We could turn
combined.txt
into a SQLite database which gets bundled into the Lambda image. This would require us to have a table(id, clue, solution)
, an index on theclue
field and a big old CSV/TSV file containing clue->solution pairs which we derive from combined.txt which we bulk import into SQLite at deployment-time. (A SQLite database is just a file, so is easily bundled into a Lambda deployment).At query time we would do
SELECT * from solutions WHERE clue='afilorsttwy'
against the bundled SQLite database.A first step would be to measure invocation times and via logging, see where the slowness is - it surely must be in the the
require
of the large dictionary files.