alphagov / govuk-frontend

GOV.UK Frontend contains the code you need to start building a user interface for government platforms and services.
https://frontend.design-system.service.gov.uk/
MIT License
1.14k stars 320 forks source link

Character count component counts code points, not characters #1104

Open 36degrees opened 5 years ago

36degrees commented 5 years ago

The character count currently uses string.length to establish the length of the user input. string.length counts code units, not characters, and this can lead to some confusing results when using certain strings.

A single emoji (👩🏻‍🚀) counted as 7 characters within the character count component

You can see this by trying the following strings into the character component:

String Result
cat😹 The emoji is counted as 2 code units, and the length is reported as 5 characters.
cafȩ́ Each combining mark is counted separately, and the reported length is 6 characters.
👩🏻‍🚀 Because this emoji includes both gender and skin modifiers and a zero-width joiner, this single character is counted as 7 characters.

We should probably find a less naive way to count characters in strings, but we also need to work out how this will work with any backend validation or data storage on a service, which may already be using a different definition of a 'character' (for example, where the backend or storage treats one character as one byte).

Further reading:

selfthinker commented 5 years ago

Another good article on the subject: https://blog.jonnew.com/posts/poo-dot-length-equals-two

dashouse commented 5 years ago

@36degrees this seems like it could end up being quite a serious bug if used in a service with multiple language support.

If we can't fix it should it be documented?

NickColley commented 5 years ago

Seems like a robust solution is very code heavy which would not be suitable for clientside.

I think Dave's suggestion of leaving it as is but documenting how it works would be the best way forwards...

simonneb commented 4 years ago

We noticed a similar issue in the character counts on the GOV.UK Notify service when sending non-English characters to the service. It turns out that Notify was counting bytes and not characters - this was fixed by by the team.

lfdebrux commented 2 years ago

MDN suggests that you can use the string iterator to count characters

function getCharacterLength (str) {
  // The string iterator that is used here iterates over characters,
  //  not mere code units
  return [...str].length;
}

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length

Might be worth spiking....

querkmachine commented 1 year ago

Intl.Segmenter is a more recent addition to the JS Internationalization API which can split strings into "graphemes" (user-perceived characters), rather than code points. Like other parts of Intl, it's locale-aware and can change how it counts depending on the configuration.

It's been available for a short time in Chromium browsers (Chrome and Edge 87, Opera 73, Samsung 14) and Safari (14.1), but is not yet supported in Firefox.

A potential issue with this is that it doesn't count new lines. New lines are registered as a code point, but are not considered graphemes as they are not "user-perceiveable" in the same way something like a space character is—they have a blank glyph and no width.


Relatedly, do we need to be sure that service teams aren't using the character count to convey technical limitations? For example, if a database column can only support a maximum of 512 characters, then they do want to limit the input to 512 code points, not 512 graphemes. Would this need to be a configuration option?

36degrees commented 1 year ago

Relatedly, do we need to be sure that service teams aren't using the character count to convey technical limitations? For example, if a database column can only support a maximum of 512 characters, then they do want to limit the input to 512 code points, not 512 graphemes. Would this need to be a configuration option?

I'd suggest making it possible to pass a custom counting function – see #1364.

When we do change the counting implementation we should treat it as a breaking change – and we might want to do #1364 first, so service teams can 'override back' to the current code point-based approach.

colinrotherham commented 1 year ago

Thanks to @querkmachine for linking me to this issue

We were both thinking a recently spotted issue in Internet Explorer 8 is likely new lines being counted as two characters. With new lines either as \n (POSIX default) versus \r\n (Windows default)

Grapheme counting code examples look huge, but would be great to align client-/server-side counts. Having the "custom counting function" as a Promise would allow a fetch() (or AJAX) response return the count if really necessary

Google Chrome

Shows "You have 23 characters too many" Character Count screenshot from Google Chrome

Internet Explorer

Shows "You have 26 characters too many" Character Count screenshot from Internet Explorer

colinrotherham commented 1 year ago

Think we can let IE8 off here

We were both thinking a recently spotted issue in Internet Explorer 8 is likely new lines being counted as two characters. With new lines either as \n (POSIX default) versus \r\n (Windows default)

In 2016 the HTML Standard switched minlength/maxlength new line normalisation from \r\n to \n

Consensus wasn't found on characters, code points and grapheme clusters:

Interesting that WebKit is sticking with grapheme clusters to avoid user confusion:

dav-idc commented 1 year ago

Here's a recent comment on the character count backlog issue: https://github.com/alphagov/govuk-design-system-backlog/issues/67#issuecomment-1377416517

Here the issue doesn't appear to be related to a specific browser, but rather that the frontend counts /n as 1 'character', but /n gets stored as 2 characters in the backend.

mgladdish commented 1 year ago

We've had a user report of this issue in production today - the frontend character count not matching the backend validation rule. It's deeply confusing for the end user and isn't a great look for our service when it appears it can't even count words consistently.

36degrees commented 2 months ago

Intl.Segmenter is a more recent addition to the JS Internationalization API which can split strings into "graphemes" (user-perceived characters), rather than code points. Like other parts of Intl, it's locale-aware and can change how it counts depending on the configuration.

It's been available for a short time in Chromium browsers (Chrome and Edge 87, Opera 73, Samsung 14) and Safari (14.1), but is not yet supported in Firefox.

A potential issue with this is that it doesn't count new lines. New lines are registered as a code point, but are not considered graphemes as they are not "user-perceiveable" in the same way something like a space character is—they have a blank glyph and no width.

Intl.Segmenter landed in Firefox 125 back in April, so we're probably at the point where we could consider using it, perhaps as part of an opt-in alternative count function that users can configure.

We should make sure to benchmark its performance, especially on lower-powered devices and in some of the older browsers that include it.

We may also need to look at reducing the number of times the count function is called.