Add more string manipulation functions to the String module [ RFC FS-1033 ]

baronfel commented 7 years ago

Submitted by TheInnerLight on 5/30/2016 12:00:00 AM
53 votes on UserVoice prior to migration

The Core.String module does not provide nearly enough features at present, too often we have to revert to using the the standard .NET string class which both hinders tidy piping and stops us taking advantage of curried args / partial application. I suggest that at least the following functions be added to the string module:

empty : string
isEmpty : string -> bool
isWhitespace : string -> bool
replace : string -> string -> string -> string
startsWith/endsWith : string -> bool
split : seq<char> -> string -> seq<string>
toUpper/toLower(Invariant) : string -> string
trim : string -> string
trimStart/trimEnd : string -> string

Obviously all of this can easily be achieved by writing simple wrappers to the methods in the .NET string class but if F# is going to have a String module, it ought to be a fully featured one.

Original UserVoice Submission Archived Uservoice Comments

tomcl commented 7 years ago

This is a glaring gap in the language, and workarounds are not nice. These functions are used all the time and should be standard.

gusty commented 7 years ago

A common problem when translating these functions to an F# friendly version is that they have many overloads with different options, typically if the string is case sensitive or not, the culture and so on. Then the next question is what will be the default, if we use the Current Culture they will not be referential transparent.

cloudRoutine commented 7 years ago

Nested modules could be used to organize the functions to get around some of the overload issues, although iirc the F# style guide recommends against them.

[<RequireQualifiedAccess>]
module String =
    open System    
    let compare str1 str2 = String.Compare(str1,str2)

    module Ordinal =
        let compare str1 str2 = String.Compare(str1,str2,StringComparison.Ordinal)

    module OrdinalIgnoreCase =
        let compare str1 str2 = String.Compare(str1,str2,StringComparison.OrdinalIgnoreCase)

    module InvariantCulture =
        let compare str1 str2 = String.Compare(str1,str2,StringComparison.InvariantCulture)

I think the sensible approach is to use static extension methods for all of String's instance methods with overloads, and a curried module function for the instance and static methods without overloads.

tomcl commented 7 years ago

You could add the simplest/best method overload as a curried function. I can't see harm in that though perhaps many would argue that convenience is a very bad price for less transparent naming. While String.ordinal would have obscure semantics, String.cultureInvariant would not: so I don't think useful defaults is any very harmful loss of transparency.

Or you could also uniformly translate methods with a selector into curried functions with selector first parameter and D.U. selector. You could combine the two: String.compare: string->string->bool String.compareWithOpt: String.Comparison -> string -> string -> bool

Extra parameters could be put into the D.U. or wrapped in Option.

Merits: All these things would encourage transparency by not propagating nulls. Autocomplete inspection of String would give you everything uniformly Novice F# users not familiar with .NET would see a more usable language. Maybe this is irrelevant, since novice users not familiar with .NET would never use F#? People used to .NET could go on using .NET methods.

I guess the array/list thing has no nice resolution except to keep .NET practices. Personally I'd love to have a String.split function that used lists throughout for uniformity.

My biggest reason for wanting everything, in some form, as curried functions is that then you can learn the language without friction. Those who are already very familiar with .NET methods don't have this issue.

dsyme commented 7 years ago

My biggest reason for wanting everything, in some form, as curried functions is that then you can learn the language without friction. Those who are already very familiar with .NET methods don't have this issue.

There are hundreds of .NET APIs. Would the same apply to DateTime, doe example? TimeSpan? Guid? Why just String? And why would this be addressed in FSharp.Core instead of some additional library (FSharpx-like?)

cartermp commented 7 years ago

I think there's a good case for more string functions in the core library. Using APIs of .NET telemetry, I've added a percentage number that represents the percentage of applications which use the corresponding System.String method/property, according to ApiPort telemetry:

empty : string                                   - 47.3%
isEmpty : string -> bool                         - 50.1%
isWhitespace : string -> bool                    - 21.4%
replace : string -> string -> string -> string   - 32.1%
startsWith/endsWith : string -> bool             - 25.4%
split : seq<char> -> string -> seq<string>       - 31.1%
toUpper/toLower(Invariant) : string -> string    - 12.3%/22.7%
trim : string -> string                          - 29.4%
trimStart/trimEnd : string -> string             - 12.5%/15.5%

That's fairly significant usage which currently has no "fluent F#" options outside of FSharpx.

I won't go into individual member usage for DateTime, Guid, and TimeSpan, but the overall usage of those types are 47.3%, 35.2%, and 36.9%, respectively. Compare this with 83.5% for String.

tomcl commented 7 years ago

There are hundreds of .NET APIs. Would the same apply to DateTime, doe example? TimeSpan? Guid? Why just String? And why would this be addressed in FSharp.Core instead of some additional library (FSharpx-like?)

I'd like more stuff added to core, but there are tradeoffs here and String is the most glaring lack. String is more a core datatype than DateTime and deficiencies in String are particularly obvious to those learning the language from start.

The fact that join is available but not split is a particular anomaly.

dsyme commented 7 years ago

@cartermp Good stats, thanks.

I'm OK with the specific list above, especially since it is determined by data rather than adhoc design.

However it's a very slippery slope. At some early point F# developers just need to learn how to call .NET APIs. The more you delay it, the more you build up the expectation that you can do more and more without doing that, and the more APIs you end up re-creating.

Strings also have the whole huge issue with language culture. Historically we've only ever put culture-invariant operations in FSharp.Core, and nothing that relies on CurrentCulture. All of the above look invariant, correct?

cartermp commented 7 years ago

I think split might be the most awkward one, since typical usage is with char and not seq<char>, but I suppose the "overloads" here are another discussion point - it would be awkward to have an F# function for one commonly-used overload, but no F# function for another commonly-used overload. I think that if we consider culture and the various overloads, it increases in size by quite a lot, but I find that to be acceptable.

This point:

At some early point F# developers just need to learn how to call .NET APIs. The more you delay it, the more you build up the expectation that you can do more and more without doing that, and the more APIs you end up re-creating.

Is a salient one. I think strings are so ubiquitous that they can warrant a bit of exception, but generally speaking I agree that we don't want FSharp.Core to be a wrapper for .NET.

dsyme commented 7 years ago

@cartermp Just to confirm that I like your list and would be happy to see a set of additions along those lines, based on that methodology, subject to RFC etc.

gusty commented 7 years ago

@cartermp Another thing to consider with split is what's the correct name-functionality? I mean, there are many ways to split a string, actually there is a whole Haskell library dedicated to handle all the different ways of splitting a string, here's is port I did. I think the name-functionality of the .NET function is a bit unfortunate: it splits on any separator and that's not obvious by the very generic name. That functionality correspond to splitOneOf but I would call it something like splitOnAny. If you ask me what a function named split should do, I would say that it splits when it finds a sequence of elements specified on the first parameter, this is called splitOn on that libray. Should we stick to the poor naming decision made in the .NET framework many years ago? We can think of a better name and signature for this function and also consider that a similar function might be added later working on lists, arrays or seqs and we would like them to be coherent with the existing one for strings.

cartermp commented 7 years ago

@gmpl I don't think we should be in the business of changing the functionality, at least not for this particular issue. The intention here is to just have nice F# wrappers over common .NET utilities which are awkward due to the lack of partial application.

gusty commented 7 years ago

@cartermp I'm fine with the functionality, but not with the name. It's too generic for that very specific functionality.

cartermp commented 7 years ago

split splitting on any parameter isn't as precise in the naming as it could be, but I don't find it surprising that if I pass in multiple characters, it will split on any of those. I passed in multiple characters, after all.

I think the larger question is this: Should we be in the business of being more precise than .NET, or should we simply offer a functional approach that's in the spirit of .NET? I believe that the latter is the best approach, particularly given the corpus of material out there on .NET APIs and their behaviors. The dynamic in this case - offering a nice wrapper around .NET APIs rather than an alternative to .NET APIs - is why I have that opinion. I'm curious about what others thing, though.

tomcl commented 7 years ago

There are two different motivations here. One is for functional
  .NET wrappers so that those familiar with .NET can transition to a
  smoother functional experience. The other is for "small but
  adequate core functionality" within a wholly functional world.
The two would lead to different functions, and care would be
  needed to differentiate the two if they coexist. But at the moment
  we have neither.
My own view in this case is that with Split mimicing .NET
  nomenclature and use is quite awkward, and simple functional split
  (with different names to differentiate from .NET perhaps) would be
  a better choice. This would be a different wrapper using the same
  implementation. But I don't feel this strongly and maybe it would
  go against practice elsewhere.
There are a very many different ways in which one might
  reasonably define a simple functional split. That need not prevent
  one from being chosen since any of them would be better than none.

gusty commented 7 years ago

@cartermp I don't feel like we are going in the direction of respecting every single name from old .NET apis, F# has its own names, for instance we have (luckily) List.map even String.map instead of Select. You take the split name because for string and chars feels like natural but then on lists someone adds a split function that does something else. Later there are user voices claiming to unify the names but it's too late to do that without breaking changes. Regarding the larger question, being more precise that .NET is not going against the spirit of .NET since even in the framework API style and names change as it evolves.

gusty commented 7 years ago

Here's a different, which unifies all cases.

I'm thinking that actually the .NET Split method is generic. Because by using the overload that takes an array of strings we can get both functionalities at the same time.

The string is a sequence of chars, so that's the splitOn functionality, but we can specify many of them, so that's the splitOneOf functionality. Additionally there is the SplitOpions parameter.

But then, moving to F# we don't want to use overloads. Then as I see it now there are two reasonable alternatives:

Create two individual functions like splitOn and splitOneOf or whichever names we decide but not the generic name split, each one with the specific functionality.
Create a single split function which takes an array (or a list which is more F#-ish) of strings, and has both functionalities at the same time. In its signature the result type is the same as the separator parameter type.

The disadvantage of the latter as @cartermp already noted is that you end up specifying a singleton array most of the times but considering the generic functionality maybe is a good trade-off.

saul commented 7 years ago

Let's get this moving along with an RFC :) https://github.com/fsharp/fslang-design/pull/186

rmunn commented 7 years ago

The RFC is being discussed at https://github.com/fsharp/fslang-design/issues/187, where I am currently arguing for defaulting to StringComparison.Ordinal in all functions where that is relevant (equals, startsWith/endsWith, indexOf, and all other string functions that need to compare strings or substrings). If anyone disagrees with that choice, please pop over to that issue and argue against my reasoning. I believe that there are good reasons for making StringComparison.Ordinal the default, but I'd hate to see a bad default chosen because I missed a better reason against Ordinal. So if anyone cares about this choice and hasn't already looked at the discussion in https://github.com/fsharp/fslang-design/issues/187, then come over there and have your say.

dsyme commented 2 years ago

Just to note that the current status of discussion is captured here: https://github.com/fsharp/fslang-design/discussions/187#discussioncomment-1225149

I do think this small number of functions should be added.

fsharp / fslang-suggestions

Add more string manipulation functions to the String module [ RFC FS-1033 ] #112