apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.66k stars 3.56k forks source link

[C#] Inconsistent String Comparisons in Schema Field Retrieval Methods #44650

Open vthemelis opened 3 weeks ago

vthemelis commented 3 weeks ago

In the Schema class, there are three methods used to retrieve individual fields:

public class Schema
{
    // ...

    public Field GetFieldByName(string name) => FieldsLookup[name].FirstOrDefault();

    public int GetFieldIndex(string name, StringComparer comparer)
    {
        IEqualityComparer<string> equalityComparer = (IEqualityComparer<string>)comparer;
        return GetFieldIndex(name, equalityComparer);
    }

    public int GetFieldIndex(string name, IEqualityComparer<string> comparer = default)
    {
        // Implementation using the specified comparer
    }
}

The first method (GetFieldByName) use the default string comparer, while the latter two (GetFieldIndex overloads) are using a culture aware comparator by default (StringComparer.CurrentCulture). This inconsistency can be confusing, as it is unclear when custom comparers are utilized.

Component(s)

C#

CurtHagenlocher commented 1 week ago

I think the best thing to do here might be something like

public static IEqualityComparer<string> DefaultFieldNameComparer = StringComparer.Ordinal;

and then change the instances of StringComparer.CurrentCulture to be DefaultFieldNameComparer. This will give a better default while still letting users revert to the previous behavior globally simply by assigning a different value to the static field.

vthemelis commented 1 week ago

I think the best thing to do here might be something like


public static IEqualityComparer<string> DefaultFieldNameComparer = StringComparer.Ordinal;

and then change the instances of StringComparer.CurrentCulture to be DefaultFieldNameComparer. This will give a better default while still letting users revert to the previous behavior globally simply by assigning a different value to the static field.

Sounds like a good idea but it could invalidate existing lookups or necessitate reindexing lookups on reassignment of its value. What are your thoughts on this?

adamreeve commented 1 week ago

It seems reasonable to expect that users should only set this once on their application startup, before any indexes are created. Maybe a comment warning about the problems that misuse could cause in a public documentation comment would be worthwhile.