RyanEdwardHall / anagrambler

Finds all anagrams and sub-anagrams in a string
MIT License
4 stars 1 forks source link

New dictionary data broke the tests' expected results values #18

Open mcgid opened 6 years ago

mcgid commented 6 years ago

Commit f2be691e37b3061aa04331a674bd07129ecf5c31 changed the dictionary data, but this changed the expected results in anagrambler_test.go.

(As an aside, all we check is that the correct number of results are returned, not what those results actually are. That's probably something we should revisit at some time.)

As far as calculating what the correct numbers are, I vaguely recall writing a separate tool that does things in a more brute-forcey way, but that clearly wouldn't work for the 182-character test word. However we figure out the new expected results for the tests, we should document it somewhere, and if it's using a separate piece of code that uses a different algorithm, that should probably be a part of this source tree because of situations just like this.

mcgid commented 6 years ago

I think fixing this issue is basically the required next step for the project. Trivially, it's necessary to be able to run the tests and the benchmarks, which we (probably) need before making any other changes.

But more fundamentally, we need to make sure that the results it's producing are actually useful. Without comparing the results of this algorithm to a different one, we can't be sure that what we're making works, and thus whether any improvements are actually helping.

The question then becomes: what alternative algorithm should we use in this contemplated new tool. Simple is obviously desirable, but the simplest approach of letter-shuffling-then-grepping suffers from combinatorial explosion.

A variation might be more useful, and might actually provide a good point of comparison: take the dictionary file, load its words into a trie (no letter sorting, just a normal trie), and then let the naive letter shuffling be constrained by the trie.

In other words, if we're generating anagrams of fmsltez, for each shuffling of the letters, traverse the trie for each letter:

  1. (root) -> f exists. Continue.
  2. (root) -> f -> m doesn't exist. Therefore all potential shufflings starting with fm are invalid. Shuffle letters after f. (For argument, say we shuffle to festzlm.)
  3. (root) -> f -> e exists. Continue.
  4. (root) -> f -> e -> s exists. Continue
  5. (root) -> f -> e -> s -> t exists and has the word 'fest' attached. Add to results. Continue.
  6. (root) -> f -> e -> s -> t -> z doesn't exist. Shuffle letters after fest.

At any point where a word is attached to a node, record it as a result. Descend recursively.

If my thinking is correct, this method should allow the naive shuffling method to produce all valid results for a search string, given enough time, while vastly cutting down on its search space. It should also be much, much less efficient than the anagrambler algorithm.

My hope in sketching this out is that it's a simple enough algorithm that we can rely on its correctness (and possibly prove it correct in some meaningful way?) while not having to wait until the sun to bloom and shrivel for it to finish execution. (For small enough n; this still wouldn't be able to take on multi-hundred-character search strings, I think.)


Once this proposed tool exists, we can compare the results with anagrambler. If they agree, and we can be reasonably certain the simpler algorithm is correct, then we can just update the test data expected results.

A more correct fix would be to change the testing to compare the actual produced results to an expected list, and I think that's the way to go. Slower, but better.

We should also try to improve the tests by finding edge case data and other possible places where things could break down; I think the current data is too simplistic. But that's another issue.