crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.22k stars 1.61k forks source link

Discussion about Grapheme API #11610

Open straight-shoota opened 2 years ago

straight-shoota commented 2 years ago

The Grapheme API for strings was introduced in #11472.

However, there is still a debate about the actual data format for graphemes. They are a sequence of codepoints, which is typically represented as String. But we chose to have a dedicated type, String::Grapheme for this. It's a wrapper around String and provides an optimization for grapheme clusters that consist of only a single code point. They are stored as Char to avoid lots of tiny string allocations for the very common single-codepoints graphemes.

This was already discussed in the introducing PR (#11472) and its precursor (#10720).

To summarize, I'm copying my original comment about representation options from https://github.com/crystal-lang/crystal/pull/11472#issuecomment-977178131 with some additions:


  1. The Char | String union type is a huge improvement for grapheme clusters consisting of a single character. That should be most common in typical use cases, so it saves a big deal of allocations compared to String (or a collection of Char as in the initial propsal). However, in some contexts, multi-character grapheme clusters are quite ubiquitous. Whether you're dealing with lots of composed emojis or scripts such as Thai, Hebrew or Arabic, or even just many decomposed diacritics, there might be a lot of string allocations for all these graphemes.
  2. We can expose the method String#each_grapheme_boundary as part of the public API. That's trivial and basically cost-free. However, I'd consider it more a low-level API. It can be useful for custom grapheme-based implementations.
  3. Employing a StringPool to avoid repeated allocations of identical grapheme clusters. There is no way to release values from a pool, so it would be a bad idea to use a global pool. But individual pools could be useful for single iterations. This could optionally be configurable and the API could allow specifying a custom pool to be used. This would be pretty easy to implement later. We just need to add StringPool parameters to Grapheme.new and the iteration methods. A pool means we still need to allocate strings for every grapheme, but we can safe on duplicate allocations of the same graphemes. It is likely that the same graphemes show up repeatedly in longer texts of the same context, so that should be expected to be benefitial.
  4. We could avoid allocating a string entirely by keeping a reference of the original string and making Grapheme essentially a pointer to a substring. String allocation would only happen in Grapheme#to_s. The big downside of this approach is that it keeps the original string in memory. That's a serious implication when working with large amounts of texts. If the Grapheme instaces are only used for iteration, this is not a problem. But if you start storing Graphemes somewhere, they keep pointers to the original strings alive. An example would be collecting frequencies of graphemes over a collection of texts. Assuming every evaluated texts contains a multi-character grapheme that was not encountered previously, all these strings will be kept in memory. The user would need to be aware of this behaviour and take action such as converting grapheme instances to self-owned cluster strings.
  5. (https://github.com/crystal-lang/crystal/pull/11472#issuecomment-977799788) Clusters consisting of multiple codepoints, but still in a small number could also be stored on the stack, with something like Char | {Char, Char} | String or {size: Int32, data: Char[4]} | String. This would cover more cases, but there is technically no limit to the length of a single cluster, so String is always needed for those that exceed the length.
  6. (https://github.com/crystal-lang/crystal/pull/11605#issuecomment-996649623) The simplest API would just use String instead of a dedicated Grapheme type. Elixir's string implementation takes this approach. The downside is that every grapheme cluster is heap allocated. This could be combined with 1. and/or 2. for some optimization. Another improvement would be a global pool for common grapheme clusters.
asterite commented 2 years ago

I think what's needed here is why one uses graphemes. What's the use case? Maybe it's for showing them in a text editor. I guess it must be something related to visual stuff. Maybe counting how many graphemes there are?

If those are the use cases, then adding more methods to Grapheme, like each_char and each_byte might not be needed. If one wants that, they can turn a Grapheme into a String. That allocates memory if the Grapheme stores a Char, but if these uses cases are rare then it should be fine.

So maybe keeping Grapheme like it's right now is fine.

straight-shoota commented 2 years ago

Yes, graphemes are about presentation. But that's really in a very similar manner as characters. You only need them for visual stuff. Everything else really works well with the raw bytes (assuming an agreed encoding). The thing is just that many (most?) things we do with strings are related to visuals. Even if they don't directly deal with rendering stuff. In fact, I'm pretty sure that when you do string operations based on characters, it very often should really be about graphemes. Using characters is a simplification when grapheme clusters are excluded in the domain or they just don't matter to the implementation.

I actually have a use case for each_char/each_byte: Fixing up https://github.com/crystal-lang/crystal/pull/11520 (see https://github.com/crystal-lang/crystal/issues/11478#issuecomment-996562879). They are very useful and if we don't add these methods, we might just use String instead of a dedicated Grapheme type because it defeats the purpose if I have to allocate a string again for every single-codepoint grapheme.

asterite commented 2 years ago

In fact, I'm pretty sure that when you do string operations based on characters, it very often should really be about graphemes

Yes! I think this would be ideal.

Ideally:

But I think we are in a point where we can't make these changes without breaking backwards compatibility. Or maybe there's a way to do it?

asterite commented 2 years ago

(by the way, this is how Elixir and Swift work, I think, except that Swift has a Character type that's essentially a grapheme)

beta-ziliani commented 2 years ago

I think it's doable—and desirable—for 2.0, but not for 1.X.

straight-shoota commented 2 years ago

I think we might be stepping away from the main topic into a different discussion. Of course, it's related and the future of grapheme vs. code point in the String API should influence the architecture of Grapheme API.

And honestly I'm not entirely sure if it's really doable - or even desirable - to essentially replace Char by Grapheme as the main constituent of strings.

Graphemes are technically more correct and probably never wrong. But there's a lot of additional complexity involved. One is memory representation - as described in this issue - and resulting performance considerations. The other is that the rules for grapheme clusters are not even static. They depend on the Unicode version. Programs using different Unicode versions may have different opinions whether a sequence of code points describes a single grapheme cluster or multiple graphemes. That's different from encoding codepoints in UTF-8, for example. The format is independent of Unicode versions and even implementations based on older versions can work with that. They may not understand the meaning of the code points, but there's no ambiguity about the byte representation of code points.

hutou commented 1 year ago

Some Unicode characters displayed in a terminal take up more than one column, and therefore can break the alignment of tables. So, it would be useful to have a method giving the "visual" length of a Unicode string. For example, ruby has the following gem : ruby-unicode-display-width Would it be appropriate to include this use case in this discussion ?

Blacksmoke16 commented 1 year ago

@hutou https://github.com/crystal-lang/crystal/issues/11038

hutou commented 1 year ago

@Blacksmoke16, Thanks

straight-shoota commented 1 year ago

With https://github.com/crystal-lang/crystal/pull/13335 it's easy to add additional fields to String (ref. https://github.com/crystal-lang/crystal/issues/12020#issuecomment-1516090822). This could open an opportunity to cache grapheme_size in the string header for repeated use.