apache / lucenenet

Apache Lucene.NET
https://lucenenet.apache.org/
Apache License 2.0
2.24k stars 638 forks source link

FieldDoc.Fields boxing issue #429

Closed NightOwl888 closed 2 years ago

NightOwl888 commented 3 years ago

The Lucene.Net.Search.FieldDoc.Fields property returns object[]. However, according to the documentation, it is designed to be used with string, int, or float.

In Java, this was not an issue because there are numeric reference types that primitive numbers are wrapped in when passed around. But, since it would not be a good user experience to require wrapper classes for numeric types in .NET we should solve the boxing issue another way.

NightOwl888 commented 3 years ago

Possible Solution

I took a look at this and how changing Lucene.Net.Search.FieldDoc to be generic would affect other types, and have come up with a possible way we can eliminate boxing/unboxing, but it does require some breaking changes across public APIs.

FieldDoc // Make FieldDoc<T>
FieldDoc.Fields // Change from object[] to T[]
IndexSearcher.SearchAfter(ScoreDoc after, Query query, int n, Sort sort) // Change ScoreDoc to FieldDoc<T>
IndexSearcher.SearchAfter(ScoreDoc after, Query query, Filter filter, int n, Sort sort) // Change ScoreDoc to FieldDoc<T>
IndexSearcher.SearchAfter(ScoreDoc after, Query query, Filter filter, int n, Sort sort, bool doDocScores, bool doMaxScore) // Change ScoreDoc to FieldDoc<T>
IndexSearcher.Search(Weight weight, FieldDoc after, int nDocs, Sort sort, bool fillFields, bool doDocScores, bool doMaxScore) // Change FieldDoc to FieldDoc<T>
TopFieldCollector.Create(Sort sort, int numHits, FieldDoc after, bool fillFields, bool trackDocScores, bool trackMaxScore, bool docsScoredInOrder) // Change FieldDoc to FieldDoc<T>
TopFieldCollector.PagingFieldCollector // Change to PagingFieldCollerctor<T> (this class is private)
TopFieldCollector.PagingFieldCollector(FieldValueHitQueue<Entry> queue, FieldDoc after, int numHits, bool fillFields, bool trackDocScores, bool trackMaxScore) // ctor: Change FieldDoc to FieldDoc<T>
FieldComparer.SetTopValue(object value) // Change to SetTopValue<TValue>(TValue value)
FieldComparer<T>.SetTopValue(object value) // Change to SetTopValue(T value) and overload SetTopValue<T>(T value). 
ToParentBlockJoinFieldComparer : FieldComparer<object> // Rely on SetTopValue<T>(T value) rather than SetTopValue(object value)? The class cannot be made generic easily due to its usage in ToParentBlockJoinSortField.GetComparer(int numHits, int sortPos).

NOTE: There are also several subclasses of FieldComparer<T> that can remove the casting, especially in cases where it is currently object to numeric type.

FieldComparer<T> was only generic in Lucene, but had to be widened in C# because there is no common type between different closing types in C# (FieldComparer<long> is a different type than FieldComparer<float> and cannot be put into a strongly typed array together). So, another abstraction was added, FieldComparer to be able to pass the type without knowing the closing type.

The FieldComparer<T> class would define a default implementation of SetTopValue<TValue>(TValue value) by casting from TValue to T using an intermediate cast to object, but can be overridden to eliminate boxing in subclasses where boxing applies.

public abstract SetTopValue(T value);

public override SetTopValue<TValue>(TValue value) // Added for .NET to eliminate boxing
{
    SetTopValue((T)(object)value);
}

This is a little weird because the base type uses a generic method and the subclass is a generic closing type, but given the leaky nature of generics and the fact that some types don't have a reasonable way to pass the generic without also becoming generic this seems like a reasonable compromise. I am open to suggestions if someone can come up with something that seems more natural than this.

NOTE: FieldComparer.CompareValues(object first, object second) also casts the generic type in the generic overload (the object overload was added in .NET because there is no common type between generic closing types in C#). Perhaps it would be better to make it FieldComparer.CompareValues<TValue>(TValue first, TValue second) on the base class as well. This leaks into Lucene.Net.Grouping types, ISearchGroup.SortValues and MergedGroup.TopValues are also declared as type object[] and may contain numeric types.

I am mentioning this here because it is another method using the same generic type on FieldComparer.

NightOwl888 commented 3 years ago

A potential problem with the above approach is that it is difficult to cast TValue to a value type, such as float. There are some potential solutions we could use here: https://stackoverflow.com/questions/3343551/how-to-cast-a-value-of-generic-type-t-to-double-without-boxing. Since the whole point of these changes is to eliminate boxing, it is probably worth sticking one of those solutions (whichever benchmarks fastest) in the Support/Util folder so we can use it throughout the solution for all of the subclasses of FieldComparer<T>.

NightOwl888 commented 3 years ago

So, @rclabo and I have been analyzing this in more detail and it is clear that FieldDoc.Fields cannot be made generic because it contains a mixed bag of field types.

As an alternative, some design proposals to FieldComparer<T> have been benchmarked in BenchmarkFieldComparer to see what the impact of such changes might be.

Benchmarks

Cast to Number

Click to expand! ``` ini BenchmarkDotNet=v0.12.0, OS=Windows 10.0.19041 Intel Core i7-8850H CPU 2.60GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET Core SDK=5.0.104 [Host] : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT Job-TNEQUD : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT InvocationCount=1 UnrollFactor=1 ``` | Method | Mean | Error | StdDev | Median | Gen 0 | Gen 1 | Gen 2 | Allocated | |---------------------------------------------- |---------:|---------:|---------:|---------:|------------:|------:|------:|--------------:| | CastComparer_PureInt32 | 207.1 ms | 4.13 ms | 9.65 ms | 211.2 ms | - | - | - | 3.93 KB | | CastComparer_PureFloat | 210.7 ms | 4.18 ms | 11.02 ms | 213.2 ms | - | - | - | 3.93 KB | | ValueType_CastToNumber_Int32 | 717.3 ms | 14.15 ms | 31.35 ms | 715.8 ms | 510000.0000 | - | - | 2343753.93 KB | | ValueType_CastToNumber_Float | 770.8 ms | 15.25 ms | 22.82 ms | 772.6 ms | 510000.0000 | - | - | 2343753.93 KB | | ReferenceType_CastToNumber_Int32_CastComparer | 309.5 ms | 6.18 ms | 14.93 ms | 310.2 ms | - | - | - | 3.93 KB | | ReferenceType_CastToNumber_Float_CastComparer | 360.4 ms | 7.13 ms | 19.65 ms | 364.4 ms | - | - | - | 3.93 KB | | ReferenceType_CastToNumber_Int32_IConvertible | 631.2 ms | 12.03 ms | 23.46 ms | 627.7 ms | - | - | - | 3.93 KB | | ReferenceType_CastToNumber_Float_IConvertible | 764.5 ms | 15.23 ms | 38.20 ms | 768.6 ms | - | - | - | 3.93 KB |

Copy

Click to expand! ``` ini BenchmarkDotNet=v0.12.0, OS=Windows 10.0.19041 Intel Core i7-8850H CPU 2.60GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET Core SDK=5.0.104 [Host] : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT Job-CBNWUR : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT InvocationCount=1 UnrollFactor=1 ``` | Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | |------------------- |---------:|--------:|--------:|------:|------:|------:|----------:| | ValueType_Copy | 104.3 ms | 2.51 ms | 7.37 ms | - | - | - | - | | ReferenceType_Copy | 136.1 ms | 2.70 ms | 6.68 ms | - | - | - | - |

Fill Fields

Click to expand! ``` ini BenchmarkDotNet=v0.12.0, OS=Windows 10.0.19041 Intel Core i7-8850H CPU 2.60GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET Core SDK=5.0.104 [Host] : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT Job-TNEQUD : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT InvocationCount=1 UnrollFactor=1 ``` | Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | |------------------------- |---------:|--------:|---------:|------:|------:|------:|----------:| | ValueType_FillFields | 253.1 us | 9.99 us | 27.34 us | - | - | - | 480480 B | | ReferenceType_FillFields | 175.8 us | 3.31 us | 3.68 us | - | - | - | 480 B |

Set Top Value

Click to expand! ``` ini BenchmarkDotNet=v0.12.0, OS=Windows 10.0.19041 Intel Core i7-8850H CPU 2.60GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET Core SDK=5.0.104 [Host] : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT Job-ZELLKO : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT InvocationCount=1 UnrollFactor=1 ``` | Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | |-------------------------- |---------:|----------:|----------:|------:|------:|------:|----------:| | ValueType_SetTopValue | 1.391 ms | 0.0969 ms | 0.2652 ms | - | - | - | - | | ReferenceType_SetTopValue | 2.474 ms | 0.0602 ms | 0.1765 ms | - | - | - | - |

Clearly, allocations used in the FieldValueHitQueue<T>.FillFields() method and when moving the numeric data out of the fields can be dramatically reduced.

The leading contender, BoxingWrappedReferenceScenario takes us a step back toward the Lucene design and although the details are not worked out, here are some of the basics:

  1. FieldComparer<T> gets a generic constraint to disallow value types.
  2. NumericComparer<T> is changed to NumericComparer<TValue, TWrapper>. TWrapper is the type that is used in its base class FieldComparer<T> and TValue is the value type that is exposed on the public API.
  3. A series of reference type wrappers are made for each numeric type similar to the Numbers Classes in Java. However, unlike our previous one-off approach to building wrappers whenever reference types are required these will be fully implemented with full tests, and put into J2N.
    • These classes will have all of the same interfaces that numeric types have in .NET including IConverible and IComparable<T>.
    • To reduce the number of breaking changes introduced to users, they will also use the implicit operator to convert the wrapped value to the numeric type without a cast.
  4. The number class references are used as the TWrapper type in NumericComparer<TValue, TWrapper>.
  5. Other related systems, such as the field cache may also be refactored to both enforce the use of reference types and to allow the system to pass the numbers class instances all the way through from their creation in Field to the FillFields() method without having to unwrap and rewrap the numeric value in a reference type.

Of course, none of this is set in stone, but it is clear that fixing this problem will likely involve breaking some public APIs at least a little so I am moving this to the 4.8.0-beta00015 milestone.