Closed mertvn closed 6 months ago
Hi, thanks for your detailed feedback. Highly appreciated!
I guess it's kind of by design. The full text index is using a Suffixx Trie and hence, each element is added multiple times within the Trie. I.e. when you add something with "foo" as key, it is also added as "oo" and "o". If you now ask for a short string, the matching node is very close to the root and will have lots of children, often containing the same node more than once: "foo" will be in the list twice when querying with 'o'. Hence, the simple linear scan can be faster because the index is actually doing much more work in such cases.
I quickly looked into the implementation and have found a few nobs I can turn without being too much to work on. However, I do not know in advance how big the improvements will be. There would also be the possibility of special casing small values or algorithmic/data structural changes such as merging long leafs or maybe switching to a DAG...
Another idea is limiting the number of results, would that be sufficient for your use case?
I will look into it and keep you updated.
I quickly looked into the implementation and have found a few nobs I can turn without being too much to work on.
Excited to see the results of this.
There would also be the possibility of special casing small values
I'm not sure how much this would help in my use-case as I have a lot of Japanese text that needs to be searched, which means thousands of unique characters. However, it might be a good idea overall, if people only have ASCII text.
Another idea is limiting the number of results, would that be sufficient for your use case?
Unfortunately not. My use-case requires me to search the data multiple times in different ways, concatenating the results from every call, and then ordering them at the end to be able to get the correct result.
Thanks for looking into it. I might be able to provide a full reproduction of my use-case if these improvements aren't enough.
Here are my result of looking into it:
While I got some improvements, it's shown to be only somewhat effective on memory consumption (~10%). There is more potential by pooling and so on, but not really runtime wise.
After that experiment, I came to the conclusion that the main problem really is the theoretical runtime complexity and no tweaking will offer that big perf improvements as LINQ is also heavily optimized. LINQ is O(n) while within the suffixx tree it will always include an O(occ)
, where occ
is the number of results. Each entry of length m
has m
-entries in the suffixx tree. So occ
can be much larger than n
for small queries which hit a node with lots of children.
This means that there needs to be an algorithmic change one way or another. So here is what I imagine could be the solution:
I'm currently improving the library to support multiple keys for all kind of indexes (maybe except the range index), which might be in fact a solution for you. Common techniques are to index a document by its words (or by n-grams, ...) and that feature will allow that. This means that occ
is getting much smaller as the suffixx trie will be much more shallow.
Results of the prototype:
Method | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|
Contains_Linq | 5,685.6 ns | 11.50 ns | 10.75 ns | 1.00 | 0.00 | 0.0229 | 408 B | 1.00 |
Contains_IndexedSet | 399.7 ns | 3.73 ns | 3.49 ns | 0.07 | 0.00 | 0.0124 | 208 B | 0.51 |
Contains_Linq_1Letter | 8,046.4 ns | 67.72 ns | 60.03 ns | 1.00 | 0.00 | 0.9460 | 15928 B | 1.00 |
Contains_IndexedSet_1Letter | 344,300.6 ns | 1,140.94 ns | 1,067.24 ns | 42.78 | 0.35 | 0.9766 | 18336 B | 1.15 |
Contains_IndexedSet_1Letter_Words | 51,951.9 ns | 113.15 ns | 105.84 ns | 6.46 | 0.05 | 1.0376 | 18336 B | 1.15 |
Contains_IndexedSet_1Letter_NGrams2 | 46,845.9 ns | 131.73 ns | 123.22 ns | 5.82 | 0.04 | 1.0376 | 18336 B | 1.15 |
Contains_IndexedSet_1Letter_NGrams4 | 123,799.1 ns | 164.19 ns | 145.55 ns | 15.39 | 0.11 | 0.9766 | 18336 B | 1.15 |
Contains_Linq_2Letter | 6,636.4 ns | 15.34 ns | 14.34 ns | 1.00 | 0.00 | 0.2441 | 4160 B | 1.00 |
Contains_IndexedSet_2Letter | 47,270.2 ns | 119.04 ns | 111.35 ns | 7.12 | 0.02 | 0.1831 | 3912 B | 0.94 |
Contains_IndexedSet_2Letter_Words | 5,222.8 ns | 24.31 ns | 20.30 ns | 0.79 | 0.00 | 0.2289 | 3912 B | 0.94 |
Contains_IndexedSet_2Letter_NGrams2 | 3,219.8 ns | 15.61 ns | 13.84 ns | 0.49 | 0.00 | 0.2327 | 3912 B | 0.94 |
Contains_IndexedSet_2Letter_NGrams4 | 9,997.8 ns | 10.01 ns | 9.37 ns | 1.51 | 0.00 | 0.2289 | 3912 B | 0.94 |
As you can see, the results are improved dramatically but still heavily depend on what you index and the query length. Do you think that would be a solution for you? I'd be willing to test that on your data if you want to provide a full reproduction.
However, if you have enough memory to throw around, a workaround (that works right now) can also be special casing: | Method | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|
Contains_Linq_1Letter | 8,193.9 ns | 29.38 ns | 26.04 ns | 1.00 | 0.00 | 0.9460 | 15928 B | 1.00 | |
Contains_IndexedSet_1Letter_SpecialCased | 622.7 ns | 2.15 ns | 1.80 ns | 0.08 | 0.00 | 0.1364 | 2288 B | 0.14 | |
Contains_Linq_2Letter | 6,653.2 ns | 20.51 ns | 19.19 ns | 1.00 | 0.00 | 0.2441 | 4160 B | 1.00 | |
Contains_IndexedSet_2Letter_SpecialCased | 160.0 ns | 0.42 ns | 0.33 ns | 0.02 | 0.00 | 0.0315 | 528 B | 0.13 |
This works like this:
_indexedSet = _document.ToIndexedSet()
.WithFullTextIndex(x => x.Content)
// special case for a single char, O(1), z is for a different index name, you might want to use a static function
.WithIndex<char>(z => z.Content)
.WithIndex(x => NGrams(x.Content, 2)) // special case for 2-gram O(1)
.Build();
IEnumerable<Document> Query(string query)
{
return query switch
{
[char a] => _indexedSet.Where<char>(z => z.Content, a),
string { Length: 2 } twoGram => _indexedSet.Where(x => NGrams(x.Content, 2), twoGram),
string fullText => _indexedSet.Contains(x => x.Content, fullText)
};
}
private static IEnumerable<string> NGrams(string text, int nGramSize)
{
for(int i = 0; i < text.Length - nGramSize; i++)
{
yield return text.Substring(i, nGramSize);
}
}
Integrating that into the FullTextIndex might also be worth a try, but that would be orthogonal to the former improvement anyway and I'd yet have to think about a good heuristic or API for this...
I actually came across this library while looking for a C# equivalent of PostgreSQL's trigram indexes, so I think using ngrams is definitely the way to go for this. Unfortunately it looks like the improvements are not good enough for my use-case. I really need it to be faster than LINQ for 1 and 2 letter cases.
The special-casing approach looks promising, but I couldn't get your example to work because I need both Contains and StartsWith queries against the data, and it looks like _indexedSet.StartsWith<char>(z => z.Content, a)
is not a thing?
I'll try to create a reproduction of my use-case, but it might take a while. Thanks again for looking into this.
I actually came across this library while looking for a C# equivalent of PostgreSQL's trigram indexes, so I think using ngrams is definitely the way to go for this. Unfortunately it looks like the improvements are not good enough for my use-case. I really need it to be faster than LINQ for 1 and 2 letter cases.
Good to hear that ngrams would be useful to you. Motivates me to keep working on the feature. I may also include some full text helpers for NGrams and similar, though that would come in a second version...
The special-casing approach looks promising, but I couldn't get your example to work because I need both Contains and StartsWith queries against the data, and it looks like _indexedSet.StartsWith
(z => z.Content, a) is not a thing?
Yes, that is not supported by that kind of index (dictionary-based). Chaining a normal where
would work tho, as the returned result set from the index should be much smaller than the entire set:
_indexedSet.Where<char>(z => z.Content, a).Where(z => z.StartsWith(a))
I'll try to create a reproduction of my use-case, but it might take a while. Thanks again for looking into this.
Thanks, might be useful for me. Especially since I'm only used to latin / germanic languages...
I'll try to create a reproduction of my use-case
Here: https://github.com/mertvn/Akade.IndexedSet.Issue103
You may have to configure the path to the mst.json
file for it to work. Beware that the data also includes some NSFW strings.
I just merged the "multi-key everything" PR and with that (local built version) I can get much better results on your data. It works by using an 3-grams approach for contains-queries and a normal prefix index. With your data, the 3-gram indices is not ideal for Prefix-queries. Hence the seperate prefix indices. It should be ok (though I have not measured it) memory-wise as the Suffixx Tries for the FullTextIndex should be much smaller with 3-grams.
Can you work with that? I hope that I can release a new version soon.
BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 11 (10.0.22631.3155)
AMD Ryzen 9 5900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 8.0.200
[Host] : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
.NET 8.0 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
Job=.NET 8.0 Runtime=.NET 8.0
Method | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Gen1 | Allocated | Alloc Ratio | |
---|---|---|---|---|---|---|---|---|---|---|
Linq_Hiragana_1Character | 5,268.0 μs | 41.86 μs | 39.15 μs | 1.00 | 0.00 | - | - | 65.90 KB | 1.00 | |
IndexedSet_Hiragana_1Character | 480.5 μs | 1.97 μs | 1.75 μs | 0.09 | 0.00 | 18.5547 | 0.9766 | 306.03 KB | 4.64 | |
New_IndexedSet_Hiragana_1Character | 51.9 μs | 0.12 μs | 0.11 μs | 0.01 | 1.5869 | - | - | 26.34 KB | 0.40 | |
Linq_Kanji_1Character | 5,173.9 μs | 13.25 μs | 11.06 μs | 1.00 | 0.00 | - | - | 17.99 KB | 1.00 | |
IndexedSet_Kanji_1Character | 147.7 μs | 0.42 μs | 0.39 μs | 0.03 | 0.00 | 5.3711 | - | 91.50 KB | 5.09 | |
New_IndexedSet_Kanji_1Character | 16.7 μs | 0.04 μs | 0.04 μs | 0.00 | 0.5188 | - | - | 8.75 KB | 0.49 | ** |
Linq_Latin_1Character | 8,414.3 μs | 69.18 μs | 61.33 μs | 1.00 | 0.00 | 31.2500 | - | 1130.97 KB | 1.00 | |
IndexedSet_Latin_1Character | 30,501.2 μs | 204.08 μs | 190.90 μs | 3.63 | 0.04 | 375.0000 | 62.5000 | 7806.23 KB | 6.90 | |
New_IndexedSet_Latin_1Character | 393.6 μs | 0.49 μs | 0.43 μs | 0.05 | 4.3945 | - | - | 73.26 KB | 0.06 | |
Linq_Latin_2Character | 5,620.1 μs | 55.46 μs | 51.88 μs | 1.00 | 0.00 | 15.6250 | - | 271.75 KB | 1.00 | |
IndexedSet_2Character | 3,094.5 μs | 8.31 μs | 7.77 μs | 0.55 | 0.01 | 117.1875 | 19.5313 | 1942.75 KB | 7.15 | |
New_IndexedSet_2Character | 28.1 μs | 0.07 μs | 0.06 μs | 0.01 | 0.5493 | - | - | 9.38 KB | 0.03 |
+ #pragma warning disable AkadeIndexedSetEXP0001
public static class Autocomplete
{
[...]
+ public static IEnumerable<string> NGrams(string input, int n)
+ {
+ for (int i = 0; i < input.Length - n + 1; i++)
+ {
+ yield return input.Substring(i, n);
+ }
+ }
public static void BuildIndexedSetMst(IEnumerable<AutocompleteMst> data)
{
SetMst = data.ToIndexedSet()
- .WithFullTextIndex(x => x.MSTLatinTitle.NormalizeForAutocomplete())
- .WithFullTextIndex(x => x.MSTNonLatinTitle.NormalizeForAutocomplete())
+ .WithPrefixIndex(x => x.MSTLatinTitle.NormalizeForAutocomplete())
+ .WithPrefixIndex(x => x.MSTNonLatinTitle.NormalizeForAutocomplete())
+ .WithFullTextIndex(x => NGrams(x.MSTLatinTitle.NormalizeForAutocomplete(), 3))
+ .WithFullTextIndex(x => NGrams(x.MSTNonLatinTitle.NormalizeForAutocomplete(), 3))
.Build();
}
[...]
public static IEnumerable<string> SearchAutocompleteMstIndexed(string arg)
{
arg = arg.NormalizeForAutocomplete();
var startsWith1 = SetMst.StartsWith(x => x.MSTLatinTitle.NormalizeForAutocomplete(), arg);
var startsWith2 = SetMst.StartsWith(x => x.MSTNonLatinTitle.NormalizeForAutocomplete(), arg);
var startsWith = startsWith1.Concat(startsWith2).OrderBy(x => x.MSTLatinTitle).ToArray();
- var contains1 = SetMst.Contains(x => x.MSTLatinTitle.NormalizeForAutocomplete(), arg);
- var contains2 = SetMst.Contains(x => x.MSTNonLatinTitle.NormalizeForAutocomplete(), arg);
- var contains = contains1.Concat(contains2).OrderBy(x => x.MSTLatinTitle).ToArray();
+ IEnumerable<AutocompleteMst> containResult = [];
+
+ foreach(string ngram in NGrams(arg, 3))
+ {
+ containResult = containResult.Concat(SetMst.Contains(x => NGrams(x.MSTLatinTitle.NormalizeForAutocomplete(), 3), ngram));
+ containResult = containResult.Concat(SetMst.Contains(x => NGrams(x.MSTNonLatinTitle.NormalizeForAutocomplete(), 3), ngram));
+ }
+ var contains = containResult.Distinct()
+ .Where(d => d.MSTLatinTitleNormalized.Contains(arg) || d.MSTNonLatinTitleNormalized.Contains(arg))
+ .OrderBy(x => x.MSTLatinTitle)
+ .ToArray();
var startsWithLT = startsWith.Select(x => x.MSTLatinTitle);
var startsWithNLT = startsWith.Select(x => x.MSTNonLatinTitle);
[...]
Looks really really good. Thanks for spending so much time on this.
By the way, is there any way you can either make the index building async, or make the built set serializable/deserializable? The reason I'm asking is that I will be using this on a Blazor WebAssembly project, and it takes like 3 times more than normal (over 5 seconds on my PC) to build the index there, while locking up the UI thread (i.e. freezing the user's browser tab completely) because multi-threading for Blazor WASM is still not implemented (keeps being pushed back every release).
Happy to help.
There is nothing that prevents you to index in a separate thread in general. There even is a concurrent version for multi-threaded scenarios. However, JS & WASM do not really do multi threading (async I/O yes, but your code runs always in the same thread). The .NET effort to make it possible using web workers is still on going: https://github.com/dotnet/runtime/issues/68162. So, unfortunately you are out of luck here.
Serialization is something to explore, but it will probably take a significant amount of work to actually make it worthwhile, i.e. that the potential speed increase is actually worth the increased serialization payload size compared to serializing only the entities.
In general though, currently not much of the optimization work time has been invested yet into making the creation faster... I'm sure there are improvements to be found there and if there's demand for that being faster, I'm happy to look into that in the future.
However, I will finalize the "value-everything" version as a first step. Unfortunately, life happened in a very bad way and I'm not able to make any promises, only that I will get to it eventually. I will close this issue then, and will then create a new issue for exploring serialization.
Understood. Hope everything is all right.
@mertvn I'm in the process of publishing the changes :) thanks for being patient & friendly, highly appreciated!
The version is marked as preview as I will evaluate if source link works and check if the "multi-key everything" feature does not break the clients I'm aware of (which I expect). Happy to hear feedback from you if the feature is an improvement for you.
Hi, I encountered this problem while trying to use this library for a real-world scenario, but it reproduces on your benchmarks as well, so I just used those. I tested it on commit 7da0898 and using .NET 7, but I doubt using main and .NET 8 would make a big difference. I'm not sure if this is a design limitation or a bug with the current implementation of the FullTextIndex, but I thought that I'd let you know in any case, as this issue prevents me from using this otherwise wonderful library.
Add to FullTextIndexBenchmarks.cs:
BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 10 (10.0.19045.2486/22H2/2022Update) AMD Ryzen 5 4500U with Radeon Graphics, 1 CPU, 6 logical and 6 physical cores .NET SDK 7.0.404 [Host] : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2 .NET 7.0 : .NET 7.0.14 (7.0.1423.51910), X64 RyuJIT AVX2
Job=.NET 7.0 Runtime=.NET 7.0