akade / Akade.IndexedSet

A convenient data structure supporting efficient in-memory indexing and querying, including range queries and fuzzy string matching.
MIT License
54 stars 5 forks source link

FullTextIndex is much slower than LINQ when infix length is < 3 #103

Closed mertvn closed 2 months ago

mertvn commented 6 months ago

Hi, I encountered this problem while trying to use this library for a real-world scenario, but it reproduces on your benchmarks as well, so I just used those. I tested it on commit 7da0898 and using .NET 7, but I doubt using main and .NET 8 would make a big difference. I'm not sure if this is a design limitation or a bug with the current implementation of the FullTextIndex, but I thought that I'd let you know in any case, as this issue prevents me from using this otherwise wonderful library.

Add to FullTextIndexBenchmarks.cs:

[Benchmark(Baseline = true)]
[BenchmarkCategory("Contains1")]
public Document[] Contains_Linq_1()
{
    return _document.Where(x => x.Content.Contains("a")).ToArray();
}

[Benchmark]
[BenchmarkCategory("Contains1")]
public Document[] Contains_IndexedSet_1()
{
    return _indexedSet.Contains(x => x.Content, "a").ToArray();
}

[Benchmark(Baseline = true)]
[BenchmarkCategory("Contains2")]
public Document[] Contains_Linq_2()
{
    return _document.Where(x => x.Content.Contains("ar")).ToArray();
}

[Benchmark]
[BenchmarkCategory("Contains2")]
public Document[] Contains_IndexedSet_2()
{
    return _indexedSet.Contains(x => x.Content, "ar").ToArray();
}

[Benchmark(Baseline = true)]
[BenchmarkCategory("Contains3")]
public Document[] Contains_Linq_3()
{
    return _document.Where(x => x.Content.Contains("aru")).ToArray();
}

[Benchmark]
[BenchmarkCategory("Contains3")]
public Document[] Contains_IndexedSet_3()
{
    return _indexedSet.Contains(x => x.Content, "aru").ToArray();
}

BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 10 (10.0.19045.2486/22H2/2022Update) AMD Ryzen 5 4500U with Radeon Graphics, 1 CPU, 6 logical and 6 physical cores .NET SDK 7.0.404 [Host] : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2 .NET 7.0 : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2

Job=.NET 7.0 Runtime=.NET 7.0

Method Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
Contains_Linq 10,053.4 ns 196.41 ns 183.72 ns 9,983.5 ns 1.00 0.00 0.1831 408 B 1.00
Contains_IndexedSet 733.2 ns 4.34 ns 4.06 ns 732.5 ns 0.07 0.00 0.1907 400 B 0.98
Contains_Linq_1 17,816.9 ns 112.16 ns 104.92 ns 17,797.6 ns 1.00 0.00 7.5989 15928 B 1.00
Contains_IndexedSet_1 648,298.9 ns 1,514.33 ns 1,182.29 ns 648,134.2 ns 36.40 0.22 19.5313 41464 B 2.60
Contains_Linq_2 12,245.6 ns 159.01 ns 148.74 ns 12,211.4 ns 1.00 0.00 1.9836 4160 B 1.00
Contains_IndexedSet_2 68,340.0 ns 802.15 ns 750.33 ns 68,749.7 ns 5.58 0.09 2.0752 4360 B 1.05
Contains_Linq_3 11,070.7 ns 194.70 ns 182.12 ns 11,141.4 ns 1.000 0.00 0.0305 72 B 1.00
Contains_IndexedSet_3 109.4 ns 1.46 ns 1.37 ns 109.6 ns 0.010 0.00 0.0726 152 B 2.11
FuzzyContains_Linq 12,447,057.9 ns 247,356.10 ns 725,452.63 ns 12,699,478.1 ns 1.00 0.00 2296.8750 4831746 B 1.000
FuzzyContains_IndexedSet 261,437.2 ns 5,098.19 ns 7,937.26 ns 261,592.0 ns 0.02 0.00 - 608 B 0.000
akade commented 6 months ago

Hi, thanks for your detailed feedback. Highly appreciated!

I guess it's kind of by design. The full text index is using a Suffixx Trie and hence, each element is added multiple times within the Trie. I.e. when you add something with "foo" as key, it is also added as "oo" and "o". If you now ask for a short string, the matching node is very close to the root and will have lots of children, often containing the same node more than once: "foo" will be in the list twice when querying with 'o'. Hence, the simple linear scan can be faster because the index is actually doing much more work in such cases.

I quickly looked into the implementation and have found a few nobs I can turn without being too much to work on. However, I do not know in advance how big the improvements will be. There would also be the possibility of special casing small values or algorithmic/data structural changes such as merging long leafs or maybe switching to a DAG...

Another idea is limiting the number of results, would that be sufficient for your use case?

I will look into it and keep you updated.

mertvn commented 6 months ago

I quickly looked into the implementation and have found a few nobs I can turn without being too much to work on.

Excited to see the results of this.

There would also be the possibility of special casing small values

I'm not sure how much this would help in my use-case as I have a lot of Japanese text that needs to be searched, which means thousands of unique characters. However, it might be a good idea overall, if people only have ASCII text.

Another idea is limiting the number of results, would that be sufficient for your use case?

Unfortunately not. My use-case requires me to search the data multiple times in different ways, concatenating the results from every call, and then ordering them at the end to be able to get the correct result.

Thanks for looking into it. I might be able to provide a full reproduction of my use-case if these improvements aren't enough.

akade commented 6 months ago

Here are my result of looking into it:

While I got some improvements, it's shown to be only somewhat effective on memory consumption (~10%). There is more potential by pooling and so on, but not really runtime wise.

After that experiment, I came to the conclusion that the main problem really is the theoretical runtime complexity and no tweaking will offer that big perf improvements as LINQ is also heavily optimized. LINQ is O(n) while within the suffixx tree it will always include an O(occ), where occ is the number of results. Each entry of length m has m-entries in the suffixx tree. So occ can be much larger than n for small queries which hit a node with lots of children.

This means that there needs to be an algorithmic change one way or another. So here is what I imagine could be the solution:

I'm currently improving the library to support multiple keys for all kind of indexes (maybe except the range index), which might be in fact a solution for you. Common techniques are to index a document by its words (or by n-grams, ...) and that feature will allow that. This means that occ is getting much smaller as the suffixx trie will be much more shallow.

Results of the prototype:

Method Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
Contains_Linq 5,685.6 ns 11.50 ns 10.75 ns 1.00 0.00 0.0229 408 B 1.00
Contains_IndexedSet 399.7 ns 3.73 ns 3.49 ns 0.07 0.00 0.0124 208 B 0.51
Contains_Linq_1Letter 8,046.4 ns 67.72 ns 60.03 ns 1.00 0.00 0.9460 15928 B 1.00
Contains_IndexedSet_1Letter 344,300.6 ns 1,140.94 ns 1,067.24 ns 42.78 0.35 0.9766 18336 B 1.15
Contains_IndexedSet_1Letter_Words 51,951.9 ns 113.15 ns 105.84 ns 6.46 0.05 1.0376 18336 B 1.15
Contains_IndexedSet_1Letter_NGrams2 46,845.9 ns 131.73 ns 123.22 ns 5.82 0.04 1.0376 18336 B 1.15
Contains_IndexedSet_1Letter_NGrams4 123,799.1 ns 164.19 ns 145.55 ns 15.39 0.11 0.9766 18336 B 1.15
Contains_Linq_2Letter 6,636.4 ns 15.34 ns 14.34 ns 1.00 0.00 0.2441 4160 B 1.00
Contains_IndexedSet_2Letter 47,270.2 ns 119.04 ns 111.35 ns 7.12 0.02 0.1831 3912 B 0.94
Contains_IndexedSet_2Letter_Words 5,222.8 ns 24.31 ns 20.30 ns 0.79 0.00 0.2289 3912 B 0.94
Contains_IndexedSet_2Letter_NGrams2 3,219.8 ns 15.61 ns 13.84 ns 0.49 0.00 0.2327 3912 B 0.94
Contains_IndexedSet_2Letter_NGrams4 9,997.8 ns 10.01 ns 9.37 ns 1.51 0.00 0.2289 3912 B 0.94

As you can see, the results are improved dramatically but still heavily depend on what you index and the query length. Do you think that would be a solution for you? I'd be willing to test that on your data if you want to provide a full reproduction.

However, if you have enough memory to throw around, a workaround (that works right now) can also be special casing: Method Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
Contains_Linq_1Letter 8,193.9 ns 29.38 ns 26.04 ns 1.00 0.00 0.9460 15928 B 1.00
Contains_IndexedSet_1Letter_SpecialCased 622.7 ns 2.15 ns 1.80 ns 0.08 0.00 0.1364 2288 B 0.14
Contains_Linq_2Letter 6,653.2 ns 20.51 ns 19.19 ns 1.00 0.00 0.2441 4160 B 1.00
Contains_IndexedSet_2Letter_SpecialCased 160.0 ns 0.42 ns 0.33 ns 0.02 0.00 0.0315 528 B 0.13

This works like this:

_indexedSet = _document.ToIndexedSet()
                       .WithFullTextIndex(x => x.Content)
                       // special case for a single char, O(1), z is for a different index name, you might want to use a static function
                       .WithIndex<char>(z => z.Content)
                       .WithIndex(x => NGrams(x.Content, 2))  // special case for 2-gram O(1) 
                       .Build();

IEnumerable<Document> Query(string query)
{
    return query switch
    {
       [char a] => _indexedSet.Where<char>(z => z.Content, a),
       string { Length: 2 } twoGram => _indexedSet.Where(x => NGrams(x.Content, 2), twoGram),
       string fullText => _indexedSet.Contains(x => x.Content, fullText)
    };
}

private static IEnumerable<string> NGrams(string text, int nGramSize)
{
    for(int i = 0; i < text.Length - nGramSize; i++)
    {
        yield return text.Substring(i, nGramSize);
    }
}

Integrating that into the FullTextIndex might also be worth a try, but that would be orthogonal to the former improvement anyway and I'd yet have to think about a good heuristic or API for this...

mertvn commented 6 months ago

I actually came across this library while looking for a C# equivalent of PostgreSQL's trigram indexes, so I think using ngrams is definitely the way to go for this. Unfortunately it looks like the improvements are not good enough for my use-case. I really need it to be faster than LINQ for 1 and 2 letter cases.

The special-casing approach looks promising, but I couldn't get your example to work because I need both Contains and StartsWith queries against the data, and it looks like _indexedSet.StartsWith<char>(z => z.Content, a) is not a thing?

I'll try to create a reproduction of my use-case, but it might take a while. Thanks again for looking into this.

akade commented 5 months ago

I actually came across this library while looking for a C# equivalent of PostgreSQL's trigram indexes, so I think using ngrams is definitely the way to go for this. Unfortunately it looks like the improvements are not good enough for my use-case. I really need it to be faster than LINQ for 1 and 2 letter cases.

Good to hear that ngrams would be useful to you. Motivates me to keep working on the feature. I may also include some full text helpers for NGrams and similar, though that would come in a second version...

The special-casing approach looks promising, but I couldn't get your example to work because I need both Contains and StartsWith queries against the data, and it looks like _indexedSet.StartsWith(z => z.Content, a) is not a thing?

Yes, that is not supported by that kind of index (dictionary-based). Chaining a normal where would work tho, as the returned result set from the index should be much smaller than the entire set:

_indexedSet.Where<char>(z => z.Content, a).Where(z => z.StartsWith(a))

I'll try to create a reproduction of my use-case, but it might take a while. Thanks again for looking into this.

Thanks, might be useful for me. Especially since I'm only used to latin / germanic languages...

mertvn commented 5 months ago

I'll try to create a reproduction of my use-case

Here: https://github.com/mertvn/Akade.IndexedSet.Issue103

You may have to configure the path to the mst.json file for it to work. Beware that the data also includes some NSFW strings.

akade commented 5 months ago

I just merged the "multi-key everything" PR and with that (local built version) I can get much better results on your data. It works by using an 3-grams approach for contains-queries and a normal prefix index. With your data, the 3-gram indices is not ideal for Prefix-queries. Hence the seperate prefix indices. It should be ok (though I have not measured it) memory-wise as the Suffixx Tries for the FullTextIndex should be much smaller with 3-grams.

Can you work with that? I hope that I can release a new version soon.

BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 11 (10.0.22631.3155)
AMD Ryzen 9 5900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 8.0.200
  [Host]   : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  .NET 8.0 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2

Job=.NET 8.0  Runtime=.NET 8.0  
Method Mean Error StdDev Ratio RatioSD Gen0 Gen1 Allocated Alloc Ratio
Linq_Hiragana_1Character 5,268.0 μs 41.86 μs 39.15 μs 1.00 0.00 - - 65.90 KB 1.00
IndexedSet_Hiragana_1Character 480.5 μs 1.97 μs 1.75 μs 0.09 0.00 18.5547 0.9766 306.03 KB 4.64
New_IndexedSet_Hiragana_1Character 51.9 μs 0.12 μs 0.11 μs 0.01 1.5869 - - 26.34 KB 0.40
Linq_Kanji_1Character 5,173.9 μs 13.25 μs 11.06 μs 1.00 0.00 - - 17.99 KB 1.00
IndexedSet_Kanji_1Character 147.7 μs 0.42 μs 0.39 μs 0.03 0.00 5.3711 - 91.50 KB 5.09
New_IndexedSet_Kanji_1Character 16.7 μs 0.04 μs 0.04 μs 0.00 0.5188 - - 8.75 KB 0.49 **
Linq_Latin_1Character 8,414.3 μs 69.18 μs 61.33 μs 1.00 0.00 31.2500 - 1130.97 KB 1.00
IndexedSet_Latin_1Character 30,501.2 μs 204.08 μs 190.90 μs 3.63 0.04 375.0000 62.5000 7806.23 KB 6.90
New_IndexedSet_Latin_1Character 393.6 μs 0.49 μs 0.43 μs 0.05 4.3945 - - 73.26 KB 0.06
Linq_Latin_2Character 5,620.1 μs 55.46 μs 51.88 μs 1.00 0.00 15.6250 - 271.75 KB 1.00
IndexedSet_2Character 3,094.5 μs 8.31 μs 7.77 μs 0.55 0.01 117.1875 19.5313 1942.75 KB 7.15
New_IndexedSet_2Character 28.1 μs 0.07 μs 0.06 μs 0.01 0.5493 - - 9.38 KB 0.03
+ #pragma warning disable AkadeIndexedSetEXP0001
public static class Autocomplete
{
[...]
+    public static IEnumerable<string> NGrams(string input, int n)
+    {
+        for (int i = 0; i < input.Length - n + 1; i++)
+        {
+            yield return input.Substring(i, n);
+        }
+    }

    public static void BuildIndexedSetMst(IEnumerable<AutocompleteMst> data)
    {
        SetMst = data.ToIndexedSet()
-            .WithFullTextIndex(x => x.MSTLatinTitle.NormalizeForAutocomplete())
-            .WithFullTextIndex(x => x.MSTNonLatinTitle.NormalizeForAutocomplete())
+            .WithPrefixIndex(x => x.MSTLatinTitle.NormalizeForAutocomplete())
+            .WithPrefixIndex(x => x.MSTNonLatinTitle.NormalizeForAutocomplete())
+            .WithFullTextIndex(x => NGrams(x.MSTLatinTitle.NormalizeForAutocomplete(), 3))
+            .WithFullTextIndex(x => NGrams(x.MSTNonLatinTitle.NormalizeForAutocomplete(), 3))
            .Build();
    }

[...]

    public static IEnumerable<string> SearchAutocompleteMstIndexed(string arg)
    {
        arg = arg.NormalizeForAutocomplete();

        var startsWith1 = SetMst.StartsWith(x => x.MSTLatinTitle.NormalizeForAutocomplete(), arg);
        var startsWith2 = SetMst.StartsWith(x => x.MSTNonLatinTitle.NormalizeForAutocomplete(), arg);
        var startsWith = startsWith1.Concat(startsWith2).OrderBy(x => x.MSTLatinTitle).ToArray();

-        var contains1 = SetMst.Contains(x => x.MSTLatinTitle.NormalizeForAutocomplete(), arg);
-        var contains2 = SetMst.Contains(x => x.MSTNonLatinTitle.NormalizeForAutocomplete(), arg);
-        var contains = contains1.Concat(contains2).OrderBy(x => x.MSTLatinTitle).ToArray();
+        IEnumerable<AutocompleteMst> containResult = [];
+
+        foreach(string ngram in NGrams(arg, 3))
+        {
+            containResult = containResult.Concat(SetMst.Contains(x => NGrams(x.MSTLatinTitle.NormalizeForAutocomplete(), 3), ngram));
+            containResult = containResult.Concat(SetMst.Contains(x => NGrams(x.MSTNonLatinTitle.NormalizeForAutocomplete(), 3), ngram));
+        }

+        var contains = containResult.Distinct()
+                                    .Where(d => d.MSTLatinTitleNormalized.Contains(arg) || d.MSTNonLatinTitleNormalized.Contains(arg))
+                                    .OrderBy(x => x.MSTLatinTitle)
+                                    .ToArray();

        var startsWithLT = startsWith.Select(x => x.MSTLatinTitle);
        var startsWithNLT = startsWith.Select(x => x.MSTNonLatinTitle);
[...]
mertvn commented 4 months ago

Looks really really good. Thanks for spending so much time on this.

By the way, is there any way you can either make the index building async, or make the built set serializable/deserializable? The reason I'm asking is that I will be using this on a Blazor WebAssembly project, and it takes like 3 times more than normal (over 5 seconds on my PC) to build the index there, while locking up the UI thread (i.e. freezing the user's browser tab completely) because multi-threading for Blazor WASM is still not implemented (keeps being pushed back every release).

akade commented 4 months ago

Happy to help.

There is nothing that prevents you to index in a separate thread in general. There even is a concurrent version for multi-threaded scenarios. However, JS & WASM do not really do multi threading (async I/O yes, but your code runs always in the same thread). The .NET effort to make it possible using web workers is still on going: https://github.com/dotnet/runtime/issues/68162. So, unfortunately you are out of luck here.

Serialization is something to explore, but it will probably take a significant amount of work to actually make it worthwhile, i.e. that the potential speed increase is actually worth the increased serialization payload size compared to serializing only the entities.

In general though, currently not much of the optimization work time has been invested yet into making the creation faster... I'm sure there are improvements to be found there and if there's demand for that being faster, I'm happy to look into that in the future.

However, I will finalize the "value-everything" version as a first step. Unfortunately, life happened in a very bad way and I'm not able to make any promises, only that I will get to it eventually. I will close this issue then, and will then create a new issue for exploring serialization.

mertvn commented 4 months ago

Understood. Hope everything is all right.

akade commented 2 months ago

@mertvn I'm in the process of publishing the changes :) thanks for being patient & friendly, highly appreciated!

The version is marked as preview as I will evaluate if source link works and check if the "multi-key everything" feature does not break the clients I'm aware of (which I expect). Happy to hear feedback from you if the feature is an improvement for you.