Kingwl / proposal-string-trim-characters

Proposal to add argument for .trim(), .trimStart() and .trimEnd() to allow strip the specified characters from strings.
MIT License
27 stars 1 forks source link

What should multicharacter parameters do? #3

Open theScottyJam opened 3 years ago

theScottyJam commented 3 years ago

A lot of discussion on this issue has already happened on this. other, unrelated issue. Refer to it for more context and information.

The question is how we should handle it when .trim/.trimStart/.trimEnd receives a string with multiple characters. Here's some options:

  1. Trim characters from the target string, as long as each character is found in the passed-in string. e.g. "abaabc".trimStart("ab") === c (most languages seem to follow this approach)
  2. Trim the entire parameter from the target string, with its order preserved. e.g. "ababac".trimStart("ab") === 'ac'
  3. Throw an error

Here are some of my thoughts on the matter:

@jamiebuilds mentioned previously of an informal survey he did, where he asked Twitter followers which behavior they would expect. The overwhelming majority said they would expect option number 2. I honestly would too - I wouldn't be surprised if I had made the wrong assumption in python a couple of times, and wrote incorrect code like this 'https://example.com'.lstrip('https://') which appears to work - until the first character of the domain starts with "h", "t", "p", or "s". it feels weird to me to treat a string as an unordered set of characters. If we want that kind of behavior, then we should have this parameter accept an optional array of characters, not a string (this means we could support both behaviors if wanted - if it's a string, do option 2, if it's an array, do option 1). It also means we can support an array of multi-character strings if wanted, thus doing both options 1 and 2 at the same time. I'm not necessarily advocating for this hybrid approach, just mentioning that it's an option.

What's more, I feel like option 2 fits much better with unicode. If we're using option 1, and this function isn't unicode aware, then the following footgun may happen:

> 'Hi! πŸ‘‹πŸΎ'.trimEnd('πŸ‘ŒπŸΎ')
'Hi! πŸ‘‹'

The "πŸ‘‹πŸΎ" emoji is composed of four characters, two for the hand, and two for the brown skin modifier. "πŸ‘ŒπŸΎ" is also composed of four characters, including the exact same brown-skin modifier. Thus, half of the "πŸ‘ŒπŸΎ" emoji can be used to remove the brown skin from the "πŸ‘‹πŸΎ" emoji.

I presume there's ways to make this unicode aware, but option 2 (or the hybrid approach) does not exhibit this issue. Plus, there's benefits to making this function not unicode aware, for example, maybe your string contains binary data, and you want .trimEnd() to operate on the bytes.

For those who's devices don't render the above unicode characters correctly, here's a screenshot of the presented code snippet: image

Kingwl commented 3 years ago

Thanks!

It's yet another interesting point.

Let's check what does prior art do:

ljharb commented 3 years ago

It seems to me like there's a simple answer here: instead of taking trim characters (which then begs the question of, what about code units? grapheme clusters? etc), why not follow string padding and take a trim string?

See https://tc39.es/ecma262/#sec-stringpad - iow, whatever string is passed in is just repeated to pad the string. Similarly, the trim methods could take a trim string that just removes all copies of that complete string.

Kingwl commented 3 years ago

why not follow string padding and take a trim string

Personally, I'm ok with that. It is more intuitive. And honestly, I would think so when I saw this.

But It would cause more difference with other languages and existed libs.

When you have some experience about other languages or you have used existed libs like lodash. It will surprise you again.

This means that when I am a user, I want it to work like trimPrefix/Postfix. But as an API designer, I will prefer to follow the common behavior (trimCharsFromStart/End).

theScottyJam commented 3 years ago

But It would cause more difference with other languages and existed libs.

This is a fair point. My rebuttal would be that either option would cause confusion for different groups of people. If you wrongly assume the behavior is trimPrefix/trimPostfix when it's not, then your code will still appear to work correctly most of the time (see my http trimming example). On the other hand, you're more likely to notice if you wrongly assumed it was trimCharFromStart/End, as it's more likely to give the wrong output.

Kingwl commented 3 years ago

Another point about Unicode that I suppose is:

code point is not equals as combining character.

Which means we have to correctly support code point (eg: single emoji). But about combining character (eg: with skin modifier), I found this:

The components of a combining character sequence are treated as individual Unicode code points even though a user might think of the whole sequence as a single character.

ljharb commented 3 years ago

That's the code unit/code point/grapheme problem, and the language doesn't really handle graphemes holistically.

Kingwl commented 3 years ago

Okay I agreed.

I realized that seems there's nothing handled about code unit/code point/grapheme in string's built-in methods before. (Unfortunately) So, we should not care about them here too.

So, the Unicode should not be a part of major point of this proposal anymore. I'll update the README. Done. And conversely, Unicode can be a reason to block the character sets solution.

We have two options here thought: