RyanEdwardHall / anagrambler

Finds all anagrams and sub-anagrams in a string
MIT License
4 stars 1 forks source link

Replace strings with byte slices everywhere #16

Closed mcgid closed 8 years ago

mcgid commented 8 years ago

NOTE: Also, stop lowercasing dictionary data. See below.

Before, we were converting the dictionary from []byte to string and then sorting and splitting it. Many allocs, splits, joins.

Now we keep everything as []byte. Since we've moved from strings to []bytes, we can't just range over the strings to step through the UTF-8 code points. Now we step through the slices by decoding the runes one at a time using utf8.DecodeRune().

It still makes sense for the external API for Add() to accept just a string arg, so meat of that function has been moved to an internal add().

add() doesn't create the sorted word itself, to allow Open() the ability to take potentially more efficient steps to allocating, copying and sorting the dictionary data all at once.

The sorting method used is the built-in sort.Sort(), but this is not necessarily the most efficient. Testing elsewhere indicates that an in-place Shell sort with the right parameters could be much faster. This isn't worth exploring now, however, since benchmarking on my machine indicates that of the ~1.2s of Open() time, only about 50ms are spent sorting.

Lowercasing

This also stops lowercasing the dictionary data as we add it. Case folding is a bit of a complicated topic, and Scrabble contains very specific, limited character sets. We shouldn't make assumptions about changing the input data; instead, we should just use data appropriate for our puprpose.

The data currently in the repo in use for testing isn't appropriate. We need to fix that at some point.

mcgid commented 8 years ago

@RyanEdwardHall If you happen to get some time, I'd appreciate it if you could have a look at this before we merge it. The changes don't have a visible impact on anything to do with the core algorithm, but it's still some notable changes in how things work.

RyanEdwardHall commented 8 years ago

I'll take a look tomorrow morning! I'm a bit surprised that the native sort isn't the fastest. Is this noticeably quicker?

mcgid commented 8 years ago

No, definitely not. The sorting really isn't taking up much time at this point.

My earlier huge Balrog win was, sadly, not due to the reimplemented sort. That speed up was actually a result of the main changes in this PR: namely, removing the string conversion, split, sort, and join.

I also chopped a huge amount of time out by making some changes that were not Unicode-friendly. I knew they were artificial wins at the time, and that when I actually implemented UTF-clean it would be much slower, so I kind of set myself up for sadness.

I have however spent some time making a set of benchmarks of sorting algorithms, which you might want to look at. Because of limitations in Go's testing apparatus I ended up writing a code generator which, in my opinion, works in a rather nice way.

The results of that little project showed that the built-in sort, which is implemented as qucksort and falls back to heap and insertion where they would be faster, ends up being slower purely because it has to deal with the Interface abstraction. It does an alloc for each sort because it converts from the interface to a concrete allocated type, and this is a few orders of magnitude slower on small data sizes.

That makes sense in general, though, because quicksort, merge, heap, etc—all the nlogn sorts—are decimated by a simple insertion for small n. In theory the built-in should just fall back to insertion with the data we're giving it, but it takes some precautions and does a few things that are just unnecessary with our dataset, and thus hand-rolled can be faster. I'm toying with the idea of implementing the winner of my tests (Shell sort) in anagrambler, but I'll probably ask your opinion first.

Sorry for the length of this email; I didn't have time to write something shorter.

On Friday, 17 June 2016, Ryan Hall notifications@github.com wrote:

I'll take a look tomorrow morning! I'm a bit surprised that the native sort isn't the fastest. Is this noticeably quicker?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/RyanEdwardHall/anagrambler/pull/16#issuecomment-226890801, or mute the thread https://github.com/notifications/unsubscribe/AGA-7nZioRyNFvOnOIBERKB4VwnyUpaLks5qMxfJgaJpZM4I4tvt .