dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.91k stars 4.63k forks source link

String comparer for sorting numeric strings logically #13979

Open Peter-Juhasz opened 9 years ago

Peter-Juhasz commented 9 years ago

Rationale

For sorting purposes it's common to need portions of strings containing numbers to be treated like numbers. Consider the list of strings "Windows 7", "Windows 10".

Using the Ordinal StringComparer to sort the list one would get

Windows 10
Windows 7

but the desired ascending logical sort would be

Windows 7
Windows 10

Proposed API

 namespace System {
     public class StringComparer {
+        public static StringComparer Create(CultureInfo culture, CompareOptions options);
     }
 }
 namespace System.Globalization {
     public enum CompareOptions {
+        NumericOrdering = 0x00000020
     }
 }

Usage

var list = new List<string> { "Windows 10", "Windows 7" };
list.Sort(StringComparer.Logical); // List is now "Windows 7", "Windows 10"

This would also be good for sorting strings containing IP addresses.

Details

Open Questions

Updates

terrajobst commented 9 years ago

That's a good suggestion. Would you be willing to make an API proposal?

MgSam commented 9 years ago

I think conceptually this is a good idea, but there are a lot of details that need to be worked out. It seems like such a comparer might want to take options which control how it behaves. For example, are commas in numbers ok? Spaces? Decimal points? How are different cultures handled?

sharwell commented 9 years ago

While it's easy to define the behavior on Windows systems as following StrCmpLogicalW, this could be problematic in the long run.

  1. The precise behavior of StrCmpLogicalW is not only undocumented, but it explicitly states that it can change in future releases of Windows.
  2. The behavior of this method is not case-sensitive, and there is no case-sensitive method which is otherwise equivalent. It would be strange, for example, to provide StringComparer.LogicalIgnoreCase without providing StringComparer.Logical.

I believe you could define a reasonable API proposal in terms of a lexicographical ordering of substrings divided at numeric boundaries. Sequences of non-digits would be compared using StringComparer.CurrentCulture or StringComparer.CurrentCultureIgnoreCase, and sequences of digits would be compared according to their numeric values.

Peter-Juhasz commented 9 years ago

The comparer should require an IFormatProvider or NumberFormatInfo; which can be passed in the constructor or it should use the current one from the owning thread.

@terrajobst This is a problem I am facing in almost every business application: proper ordering even if the strings contain numbers. I don't have time to make a PR or document it well, but I would be very happy to see it in the new releases of the core framework. I think the use case is clear enough.

eatdrinksleepcode commented 9 years ago

This is a need that I have often faced. If @Peter-Juhasz does not have the time to create the API proposal or implement it, I would be interested in taking it on.

sharwell commented 9 years ago

@Peter-Juhasz There is probably no need to consider the NumberFormatInfo. As stated in the documentation for NumberFormatInfo.NativeDigits:

The character set that is specified by the NativeDigits property has no effect on parsing or formatting operations. Only the Basic Latin digits 0 (U+0030) through 9 (U+0039) are used when formatting or parsing numeric values or date and time values.

If StringComparer.Logical and StringComparer.LogicalIgnoreCase were defined to use StringComparer.CurrentCulture and StringComparer.CurrentCultureIgnoreCase for comparing non-digits, the culture used for the comparison is clear.

On a side note, even if used there is no need to explicitly provide the NumberFormatInfo to the comparer since the current culture information already provides it.

tarekgh commented 8 years ago

Looks in Windows, CompareStringEx can be used with the flag SORT_DIGITSASNUMBERS. I didn't look at ICU if it support such functionality too.

krwq commented 7 years ago

I believe there is few things which need to be worked on here:

I think this would be a nice addition but it needs some work.

tarekgh commented 7 years ago

should we provide all permutations for all comparers now or perhaps generate comparer based on some settings (perhaps we already have something like that which we might be able to inject this log in and I might not be aware of it)

I don't think we need to do that. it would be just option (or flag) passed with the CompareOptions and let the underlying OS handle it.

I think this would be a nice addition but it needs some work.

I agree with that

jnm2 commented 7 years ago

I'd see value in both an OS-dependent sort for cases where you want to match the OS (file listing, for example), and an OS-independent .NET sort that natively understands both numbers and dates. A real life example is needing month names to be correctly sorted, even when that's the only date-related part of the string.

The OS-dependent sort should be super simple to implement. After that's done I think it would still be worth putting design time in towards an independent managed sort as well, to figure out if and how people's needs differ.

TylerBrinkley commented 7 years ago

Rationale

For sorting purposes it's common to need portions of strings containing numbers to be treated like numbers. Consider the list of strings "Windows 7", "Windows 10".

Using the Ordinal StringComparer to sort the list one would get

Windows 10
Windows 7

but the desired ascending logical sort would be

Windows 7
Windows 10

Proposed API

 namespace System {
     public class StringComparer {
+        public static StringComparer Create(CultureInfo culture, CompareOptions options);
+        public static StringComparer Logical { get; }
+        public static StringComparer LogicalIgnoreCase { get; }
     }
 }
 namespace System.Globalization {
     public enum CompareOptions {
+        Logical = 0x00000020
     }
 }

Usage

var list = new List<string> { "Windows 10", "Windows 7" };
list.Sort(StringComparer.Logical); // List is now "Windows 7", "Windows 10"

This would also be good for sorting strings containing IP addresses.

Details

Open Questions

Updates

tarekgh commented 7 years ago

we are limited to what the underlying OS can give us as functionality support, this need to be investigated if both Windows and Linux can provide this functionality before we proceed here.

TylerBrinkley commented 7 years ago

I don't think this should necessarily follow the Windows functionality but be entirely implemented in .NET.

tarekgh commented 7 years ago

I don't think this should necessarily follow the Windows functionality but be entirely implemented in .NET.

This is not easy to do especially we don't carry the needed collation tables needed to do this.

if it is just ASCII characters that is possible. if you have more complicated scenario that will be challenging except if you go and parse the string split them. this will not be nice performance wise.

in short, Linguistic comparisons will be challenging.

TylerBrinkley commented 7 years ago

You're probably right about the performance. I'm currently using an implementation that splits the string which works for me but I understand it may not be quick enough for the framework due to the string allocations.

tarekgh commented 7 years ago

it is not regarding the string allocations. we can pin the strings and avoid the allocations. it is about parsing the string and then compare each part and then handle the digits comparisons. note that, such functionality has to support all digits for all languages too and not just 0~9.

TylerBrinkley commented 7 years ago

Ugh, supporting digits for all languages would be a pain as you'd have to go by the NativeDigits property on NumberFormatInfo. I suppose a fast path could be added for 0-9 and a slow path for other languages.

After looking at my implementation I think char.IsDigit would work just fine.

jnm2 commented 7 years ago

For some scenarios I'd prefer an OS-dependent logical sort, for others an invariant logical sort. I'd really love something like this:

var comparer = StringComparer.TryGetOSLogical()
    ?? StringComparer.CreateLogical(StringComparer.CurrentCulture);

I want to match the feel of the native OS if possible.

tarekgh commented 7 years ago

Ugh, supporting digits for all languages would be a pain as you'd have to go by the NativeDigits property on NumberFormatInfo. I suppose a fast path could be added for 0-9 and a slow path for other languages. After looking at my implementation I think char.IsDigit would work just fine.

We don't have to use NumberFormatInfo.NativeDigits because this will make it work with only specific culture while digits in all languages should just work regardless of the chosen culture because digits doesn't have linguistic context (in most cases).

char.IsDigit is not helpful either because you need to know the value of the digit and not just check if it is digit to be able to support the needed functionality.

I am not trying to push back here but I am just explaining the complexity we'll have if we need to do it without OS help.

var comparer = StringComparer.TryGetOSLogical() ?? StringComparer.CreateLogical(StringComparer.CurrentCulture);

I hope StringComparer.TryGetOSLogical() exist and work all the time but the logic to fallback will create some other issues regarding the difference in the behavior if OS support it and if not. This may be OK but we have to understand that.

TylerBrinkley commented 7 years ago

I understand and thank you for going through the complexities with me. In my implementation I TryParse the substring number passing in the NumberFormatInfo, why would that not work? Performance?

tarekgh commented 7 years ago

I understand and thank you for going through the complexities with me. In my implementation I TryParse the substring number passing in the NumberFormatInfo, why would that not work? Performance?

TylerBrinkley commented 7 years ago

Int32.TryParse with NumberFormatInfo will not parse native digits.

If Int32.TryParse doesn't support native digits, why should this? As far as I can tell even the native StrCmpLogicalW doesn't support native digits.

tarekgh commented 7 years ago

If Int32.TryParse doesn't support native digits, why should this? As far as I can tell even the native StrCmpLogicalW doesn't support native digits.

Because the functionality we are talking about is regarding the collation which support all native digits. Also there is some requests we need to support native digits in the formatting and parsing APIs but we didn't get into it yet.

In short, if we support only 0~9 digits, sooner or later we'll get request to support the native digits. so it will be good to have a good plan in our support before we jump doing it.

joperezr commented 7 years ago

@tarekgh do you think it's valuable to mark this as Api ready for review and then if it is approved then try to sort out implementation details?

tarekgh commented 7 years ago

do you think it's valuable to mark this as Api ready for review and then if it is approved then try to sort out implementation details?

No, we don't have a good way to implement this at least on Windows today. without figuring out the implementation story we cannot proceed with the API.

sharwell commented 7 years ago

@tarekgh

...we don't have a good way to implement this at least on Windows today...

I'm not sure I agree with this completely. It should be possible to implement a comparer which uses StringComparer.CurrentCulture for substrings that do not contain the code points [0-9], and a numeric comparison based on int.Parse for sequences containing only those code points.

The API could be defined to simply be two new properties of the StringComparer class.

Since neither of these properties is an invariant string comparer, it is allowable that future updates make changes to accommodate various cultures, including but not limited to changes in the way native digits are handled.

tarekgh commented 7 years ago

for substrings that do not contain the code points [0-9], and a numeric comparison based on int.Parse for sequences containing only those code points.

This will not work with non Latin digits. if you use other numerals from other languages that will not work. please read the discussion in this issue. other concern from the proposed implementation is the performance. splitting the string and then comparing parts will be expensive. Also should we consider cases like (1,0000, 1.1, ...etc.)? did you think about that?

TylerBrinkley commented 7 years ago

char.IsDigit is not helpful either because you need to know the value of the digit and not just check if it is digit to be able to support the needed functionality.

@tarekgh I've updated my proposal to specify the use of the Char.GetNumericValue method to retrieve its value. I've also added @sharwell 's Logical and LogicalIgnoreCase properties.

tarekgh commented 7 years ago

I think for the API interface we can do the following:

The benefit here is this will enable string comparison APIs too and not just the string comparer.

sharwell commented 7 years ago

This will not work with non Latin digits. if you use other numerals from other languages that will not work. please read the discussion in this issue.

Yes, this behavior is intentional.

other concern from the proposed implementation is the performance. splitting the string and then comparing parts will be expensive.

There is no need to split the string. You can identify the numeric portions easily, with will also reveal the sequences of non-digits. The sequences of non-digits can be compared using CultureInfo.CurrentCulture.CompareInfo.Compare, which can operate on explicit substrings without [external] allocation.

Also should we consider cases like (1,0000, 1.1, ...etc.)? did you think about that?

Yes, the , and . elements seen here do not fall in the character range [0-9]. The comparison would treat 1,0000 as three sequences: 1 (numeric), , (non-numeric), 0000 (numeric). The string 1.1 would likewise be treated as three sequences. In my proposal, the current culture has no impact on the treatment of any individual code point as a digit or non-digit. Note that the treatment of "fractional digits" in this manner (ignoring them) is consistent with the way Windows already performs logical sorting in Explorer (for example).

My goal is to provide a clean and immediately usable API without restricting its ability to meet needs in the future. I do not believe it needs to be complicated in order to achieve this outcome.

TylerBrinkley commented 7 years ago

If we only use the CompareOptions enum then we'll be limited to only supporting the base integer case without sign, decimal, or thousands support. Also, it would be far less visible and hardly anyone would use it that would otherwise.

With my proposal the user can specify whether to AllowLeadingSign, AllowTrailingSign, AllowDecimalPoint, and AllowThousands using the NumberStyles flag enum parameter and also what strings to use for the NegativeSign, PositiveSign, NumberDecimalSeparator, and NumberGroupSeparator using the NumberFormatInfo parameter. And being listed in the StringComparer class makes it very easy to discover.

tarekgh commented 7 years ago

@sharwell

Yes, this behavior is intentional.

why you think this is good?

There is no need to split the string.

I didn't mean split the string. I meant you need to search the string for the numbers first and then compare the part before the number then compare the number and then repeat the process. that what I meant will be expensive.

My goal is to provide a clean and immediately usable API without restricting its ability to meet needs in the future. I do not believe it needs to be complicated in order to achieve this outcome.

I understand but in same time we need to get the right design to serve us in the future if we need to expand the functionality.

tarekgh commented 7 years ago

@TylerBrinkley please move the proposal to the top of the issue.

@joperezr we can mark this as ready for review according to the latest discussion.

tarekgh commented 7 years ago

One last note, for the

        public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle, NumberFormatInfo numberFormatInfo);

I think this will be a little bit confusing as we can can create StringComparer for specific culture that match the desired number format. I know that could have have a comparer for specific culture and in same time you want to treat the numbers from other culture but that is unlikely and it would be good just not include it till we find a real need for it. other than that the proposal looks reasonable to me

TylerBrinkley commented 7 years ago

How do you recommend we get the NumberFormatInfo though? For a CultureAwareComparer we would need to add another field initialized from the constructor and for non-CultureAwareComparer's we would need to use a default such as CurrentInfo or else we would have to add a NumberFormatInfo or IFormatProvider property to StringComparer which seems wrong.

Also since there is already a Create(CultureInfo culture, bool ignoreCase) method, should we also add overloads for those parameters, e.g. CreateLogical(CultureInfo culture, bool ignoreCase) and CreateLogical(CultureInfo culture, bool ignoreCase, NumberStyles numberStyle)?

TylerBrinkley commented 7 years ago

I've updated the proposal to retrieve the NumberFormatInfo if not explicitly provided from CultureAwareComparer's and for other comparers a default of NumberFormatInfo.CurrentInfo will be used.

I've also added CreateLogical overloads for the parameters included in the Create(CultureInfo culture, bool ignoreCase) method.

@TylerBrinkley please move the proposal to the top of the issue.

I don't have access to do that.

tarekgh commented 7 years ago

How do you recommend we get the NumberFormatInfo though

While creating the string comparer we'll store the corresponding NFI object. for example, if you are creating CurrentCulture comparer we'll get NFI from current culture and use it. this will apply on culture aware comparer. so we should be find here without passing NFI to the constructors of String comparers.

CreateLogical(CultureInfo culture, bool ignoreCase)

please don't add any overloads to this one as we have introduced a new signature that is taking CompareOptions instead of ignoreCase bool. we should advise people to use this one instead.

@TylerBrinkley please update the proposal accordingly and I'll move it to the top after you do that. thanks.

TylerBrinkley commented 7 years ago

So do you want me to remove those overloads and add ones with CompareOptions instead of the ignoreCase bool or just remove those overloads?

tarekgh commented 7 years ago

I suggest to have only

        public static StringComparer CreateLogical(StringComparer stringComparer);
        public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle);
        public static StringComparer Logical { get; }
        public static StringComparer LogicalIgnoreCase { get; }

The signature taking StringComparer (the first and second one) is covering the scenario of culture aware comparer which can be created using ignoreCase or CompareOptions. Also I removed the one which is taking NFI.

We still have the freedom in the future to add more signatures if proven we really need them.

What you think?

tarekgh commented 7 years ago

Also, do you think naming Logical is good here? I understand Windows API StrCmpLogicalW called logical but I am not sure if this is really easy to understand what it means. anyway, we may get more feedback from the design reviewer as needed if we don't come up with better naming now.

TylerBrinkley commented 7 years ago

While that's acceptable I don't necessarily like that a user can't use an Ordinal StringComparer with an NFI that's not the current cultures.

Also, do you think naming Logical is good here?

I can't think of anything better but naming it something different is definitely worth considering.

tarekgh commented 7 years ago

While that's acceptable I don't necessarily like that a user can't use an Ordinal StringComparer with a culture's NFI besides the current.

This still can be achieved by create culture aware comparer with CompareOptions.Ordinal and use the first or second signature.

TylerBrinkley commented 7 years ago

What's the API to create a culture aware comparer with CompareOptions.Ordinal? Is that not implemented yet?

From a brief search Natural is the only other reasonable name I could find.

tarekgh commented 7 years ago

What's the API to create a culture aware comparer with CompareOptions.Ordinal? Is that not implemented yet?

We have it here: https://github.com/dotnet/corefx/blob/master/src/System.Globalization.Extensions/src/System/Globalization/Extensions.cs#L13

But I can see it will complicate our implementation to extract the NFI. So I am OK to return back the signatures that taking NFI.

Also we need to define the behavior when someone using public static StringComparer CreateLogical(StringComparer stringComparer); public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle);

We may be able to construct the NFI from the comparer if it is one of framework comparer but we'll not be able to construct it if the caller used custom comparer. so we may need to add overloads taking NFI for the methods taking StringComparer.

        public static StringComparer CreateLogical(StringComparer stringComparer);
        public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle);
        public static StringComparer CreateLogical(StringComparer stringComparer, NumberFormatInfo nfi);
        public static StringComparer CreateLogical(StringComparer stringComparer, , NumberFormatInfo nfi, NumberStyles numberStyle);
        public static StringComparer Logical { get; }
        public static StringComparer LogicalIgnoreCase { get; }
jnm2 commented 7 years ago

I think StringComparer.Logical has the correct connotations. You could also try for StringComparer.Intelligent or StringComparer.Smart.

tarekgh commented 7 years ago

And we'll use Invariant.NumberFormatInfo when the caller not passing NFI to us

tarekgh commented 7 years ago

I think Logical has the correct connotations. You could also try for Intelligent.

let's stick with Logical for now

jnm2 commented 7 years ago

What, you don't like StringComparer.BestestBetterness? 😆

TylerBrinkley commented 7 years ago

We may be able to construct the NFI from the comparer if it is one of framework comparer but we'll not be able to construct it if the caller used custom comparer. so we may need to add overloads taking NFI for the methods taking StringComparer.

NumberFormatInfo is only needed when using NumberStyles because the default value is specified as None, meaning we don't need to know what the signs, group separators, or decimal separators are. Thus we only need the already proposed method.

public static StringComparer CreateLogical(StringComparer stringComparer, NumberStyles numberStyle, NumberFormatInfo numberFormatInfo);

I've removed the CreateLogical overloads that match the Create method's parameters.

tarekgh commented 7 years ago

NumberFormatInfo is only needed when using NumberStyles because the default value is specified as None,

NFI is needed to parse the numbers in the strings. right? how you know the decimal point, thousand separator...etc. when parsing the number?