bzaar / DawgSharp

DAWG String Dictionary in C#
http://www.nuget.org/packages/DawgSharp/
GNU General Public License v3.0
118 stars 18 forks source link

StackOverflowException #2

Closed ghost closed 10 years ago

ghost commented 10 years ago

BuildDawg failed using the word list from https://raw.github.com/eneko/data-repository/master/data/words.txt due to StackOverflowException generated from LevelBuilder.GetLevel method.

bzaar commented 10 years ago

My project's first bug. How exciting!

Too bad I couldn't recreate it. I've added this test: 50908ca8f10419106a84b1259d12c956f771519c. @drdigit, could you run it up and see if you get the exception?

ghost commented 10 years ago

I got the exception again because the file I used was malformed. At the end of it there were a lot of extra bytes and one of the parsed lines was really long. Therefore the exception was more than expected.

I added the following method as a quick workaround to make the code a bit safer.

public bool InsertWithLimit(string key, TPayload value)
{
    if (key == null || key.Length > 2048)
        return false;
    var node = root;
    foreach (char c in key)
        node = node.GetOrAddEdge(c);
    node.Payload = value;
    return true;
}

InsertWithLimit will let me know that it "refused" to insert the chemical name of the titin protein (180K+ characters) without throwing an exception. IEnumerable characters for a key are replaced with a string because of the performance of Count method. The 2048 limit is random and safe; it could be 4096 but not 16384.

bzaar commented 10 years ago

Ah! So the problem is long strings, not long lists. It's fixed now. Pushed version 1.0.6 to NuGet.

bzaar commented 10 years ago

Mind you, it will not break now, but performance may suffer. 200K strings is not really what DAWG was designed for. It will handle them, but they will take 3x more disk than a UTF-16 text file. Unless there are many of such chemical names and they share long common prefixes or suffixes in which case you might benefit from using DAWG.

ghost commented 10 years ago

The fix worth a lot of stars. Kudos.