Open 36degrees opened 5 years ago
Another good article on the subject: https://blog.jonnew.com/posts/poo-dot-length-equals-two
@36degrees this seems like it could end up being quite a serious bug if used in a service with multiple language support.
If we can't fix it should it be documented?
Seems like a robust solution is very code heavy which would not be suitable for clientside.
I think Dave's suggestion of leaving it as is but documenting how it works would be the best way forwards...
We noticed a similar issue in the character counts on the GOV.UK Notify service when sending non-English characters to the service. It turns out that Notify was counting bytes and not characters - this was fixed by by the team.
MDN suggests that you can use the string iterator to count characters
function getCharacterLength (str) {
// The string iterator that is used here iterates over characters,
// not mere code units
return [...str].length;
}
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length
Might be worth spiking....
Intl.Segmenter
is a more recent addition to the JS Internationalization API which can split strings into "graphemes" (user-perceived characters), rather than code points. Like other parts of Intl
, it's locale-aware and can change how it counts depending on the configuration.
It's been available for a short time in Chromium browsers (Chrome and Edge 87, Opera 73, Samsung 14) and Safari (14.1), but is not yet supported in Firefox.
A potential issue with this is that it doesn't count new lines. New lines are registered as a code point, but are not considered graphemes as they are not "user-perceiveable" in the same way something like a space character is—they have a blank glyph and no width.
Relatedly, do we need to be sure that service teams aren't using the character count to convey technical limitations? For example, if a database column can only support a maximum of 512 characters, then they do want to limit the input to 512 code points, not 512 graphemes. Would this need to be a configuration option?
Relatedly, do we need to be sure that service teams aren't using the character count to convey technical limitations? For example, if a database column can only support a maximum of 512 characters, then they do want to limit the input to 512 code points, not 512 graphemes. Would this need to be a configuration option?
I'd suggest making it possible to pass a custom counting function – see #1364.
When we do change the counting implementation we should treat it as a breaking change – and we might want to do #1364 first, so service teams can 'override back' to the current code point-based approach.
Thanks to @querkmachine for linking me to this issue
We were both thinking a recently spotted issue in Internet Explorer 8 is likely new lines being counted as two characters. With new lines either as \n
(POSIX default) versus \r\n
(Windows default)
Grapheme counting code examples look huge, but would be great to align client-/server-side counts. Having the "custom counting function" as a Promise would allow a fetch()
(or AJAX) response return the count if really necessary
Shows "You have 23 characters too many"
Shows "You have 26 characters too many"
Think we can let IE8 off here
We were both thinking a recently spotted issue in Internet Explorer 8 is likely new lines being counted as two characters. With new lines either as
\n
(POSIX default) versus\r\n
(Windows default)
In 2016 the HTML Standard switched minlength/maxlength
new line normalisation from \r\n
to \n
Consensus wasn't found on characters, code points and grapheme clusters:
Interesting that WebKit is sticking with grapheme clusters to avoid user confusion:
RESOLVED WONTFIX
Here's a recent comment on the character count backlog issue: https://github.com/alphagov/govuk-design-system-backlog/issues/67#issuecomment-1377416517
Here the issue doesn't appear to be related to a specific browser, but rather that the frontend counts /n
as 1 'character', but /n
gets stored as 2 characters in the backend.
We've had a user report of this issue in production today - the frontend character count not matching the backend validation rule. It's deeply confusing for the end user and isn't a great look for our service when it appears it can't even count words consistently.
Intl.Segmenter
is a more recent addition to the JS Internationalization API which can split strings into "graphemes" (user-perceived characters), rather than code points. Like other parts ofIntl
, it's locale-aware and can change how it counts depending on the configuration.It's been available for a short time in Chromium browsers (Chrome and Edge 87, Opera 73, Samsung 14) and Safari (14.1), but is not yet supported in Firefox.
A potential issue with this is that it doesn't count new lines. New lines are registered as a code point, but are not considered graphemes as they are not "user-perceiveable" in the same way something like a space character is—they have a blank glyph and no width.
Intl.Segmenter
landed in Firefox 125 back in April, so we're probably at the point where we could consider using it, perhaps as part of an opt-in alternative count function that users can configure.
We should make sure to benchmark its performance, especially on lower-powered devices and in some of the older browsers that include it.
We may also need to look at reducing the number of times the count function is called.
The character count currently uses
string.length
to establish the length of the user input.string.length
counts code units, not characters, and this can lead to some confusing results when using certain strings.You can see this by trying the following strings into the character component:
We should probably find a less naive way to count characters in strings, but we also need to work out how this will work with any backend validation or data storage on a service, which may already be using a different definition of a 'character' (for example, where the backend or storage treats one character as one byte).
Further reading: