dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.08k stars 4.69k forks source link

[API Proposal]: Allow Leading Unicode Whitespace in NumberStyles #86834

Open Megasware128 opened 1 year ago

Megasware128 commented 1 year ago

Background and motivation

Currently, the NumberStyles enum allows parsing of leading and trailing white spaces, but only for a subset of Unicode characters. However, in certain scenarios, developers may encounter numbers prefixed by Unicode white-space characters not covered by the current implementation. This issue was first discussed in #86641. This proposal aims to extend the functionality of the NumberStyles enum to support leading Unicode white-space characters, thereby offering greater flexibility in handling various text formats.

API Proposal

namespace System.Globalization
{  
    [Flags]
    public enum NumberStyles
    {  
        AllowLeadingWhite = 1,
        AllowTrailingWhite = 2,
        // Proposed addition:
        AllowLeadingUnicodeWhitespace = 2048, // The value can be decided during implementation
        // Rest of the enum values...
    }  
}

API Usage

string numberString = "\u2009" + "42"; // Thin space (U+2009) followed by the number "42"
int result;
if (int.TryParse(numberString, NumberStyles.Integer | NumberStyles.AllowLeadingUnicodeWhitespace, CultureInfo.InvariantCulture, out result))
{
    Console.WriteLine(result);  // Outputs: 42
}
else
{
    Console.WriteLine("Failed to parse numberString.");
}

Alternative Designs

An alternative approach could be to update the AllowLeadingWhite and AllowTrailingWhite flags to include all Unicode white-space characters, but this would be a significant behavioral change and may not be backward compatible.

Risks

The main risk involves potential breaking changes for existing applications relying on the current AllowLeadingWhite and AllowTrailingWhite behaviors. It's also important to consider the performance implications of expanding the range of characters that need to be checked during parsing operations.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-numerics See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation Currently, the `NumberStyles` enum allows parsing of leading and trailing white spaces, but only for a subset of Unicode characters. However, in certain scenarios, developers may encounter numbers prefixed by Unicode white-space characters not covered by the current implementation. This issue was first discussed in #86641. This proposal aims to extend the functionality of the `NumberStyles` enum to support leading Unicode white-space characters, thereby offering greater flexibility in handling various text formats. ### API Proposal ```csharp namespace System.Globalization { [Flags] public enum NumberStyles { AllowLeadingWhite = 1, AllowTrailingWhite = 2, // Proposed addition: AllowLeadingUnicodeWhitespace = 2048, // The value can be decided during implementation // Rest of the enum values... } } ``` ### API Usage ```csharp string numberString = "\u2009" + "42"; // Thin space (U+2009) followed by the number "42" int result; if (int.TryParse(numberString, NumberStyles.Integer | NumberStyles.AllowLeadingUnicodeWhitespace, CultureInfo.InvariantCulture, out result)) { Console.WriteLine(result); // Outputs: 42 } else { Console.WriteLine("Failed to parse numberString."); } ``` ### Alternative Designs An alternative approach could be to update the `AllowLeadingWhite` and `AllowTrailingWhite` flags to include all Unicode white-space characters, but this would be a significant behavioral change and may not be backward compatible. ### Risks The main risk involves potential breaking changes for existing applications relying on the current `AllowLeadingWhite` and `AllowTrailingWhite` behaviors. It's also important to consider the performance implications of expanding the range of characters that need to be checked during parsing operations.
Author: Megasware128
Assignees: -
Labels: `api-suggestion`, `area-System.Numerics`, `untriaged`
Milestone: -
huoyaoyuan commented 1 year ago

A workaround is numberString.AsSpan().Trim() (or TrimStart/TrimEnd), which handles Unicode whitespaces.

Megasware128 commented 1 year ago

@huoyaoyuan I noticed the issue while parsing a currency. Trimming wouldn't have worked because it had a currency symbol in front of the whitespace. My workaround was this:

private static double ParsePrice(string price)
{
    Span<char> chars = stackalloc char[price.Length];

    // Replace non-standard whitespace with regular whitespace
    for (var i = 0; i < price.Length; i++)
    {
        var c = price[i];

        chars[i] = char.IsWhiteSpace(c) ? ' ' : c;
    }

    return double.Parse(chars, NumberStyles.Currency, new CultureInfo("nl-NL"));
}
danmoseley commented 1 year ago

currency symbol in front of the whitespace

Then the ws isn't leading - would your proposed API have helped?

Megasware128 commented 1 year ago

Yes @danmoseley, the proposed AllowLeadingUnicodeWhitespace would still have been beneficial. The proposed flag is intended to behave similarly to the existing AllowLeadingWhite, but with an extended range of recognized Unicode whitespace characters. Therefore, when combined with AllowCurrencySymbol, parsing would still occur if the currency symbol is leading and the Unicode whitespace follows it. The key point here is to expand the range of recognized whitespace characters in various contexts, not just when they are the leading characters.

However, as discussed in the API proposal, there are alternative designs. One of them could be to update the AllowLeadingWhite and AllowTrailingWhite flags to include all Unicode white-space characters. But this could be a significant behavioral change and may not be backward compatible. The AllowLeadingUnicodeWhitespace proposal aims to provide a more flexible solution that can work in combination with existing flags and behaviors.