[PM-9550] Fixes a mojibake bug in web authenticators in some multibyte locales

Fixes #3345

Make the implementation of the character conversion consistent with their consumers in the web app.

Type of change

[x] Bug fix
[ ] New feature development
[ ] Tech debt (refactoring, code cleanup, dependency upgrades, etc)
[ ] Build/deploy pipeline (DevOps)
[ ] Other

Objective

This PR fixes a mojibake issue in the FIDO2 Web Authn.

c.f. https://en.wikipedia.org/wiki/Mojibake

You can see the detailed description of the issue and and screenshots in the issue #3345. In short, the current implementation displays a wrong, unreadable text "WebAuthn ã�®èª�è¨¼"" where it should display "Web Authn の認証" when the user locale is Japanese.

You can also find a minimal reproducible example of the issue at https://github.com/bitwarden/mobile/issues/3345#issuecomment-2212340286.

I have actually tried reproducing the issue only in Japanese locale. In addition, theoretically it can happen in any locales whose localized text of Fido2AuthenticateWebAuthn message contains multibyte characters. You should be also able to reproduce the issue in an arbitrary locale by inserting a Unicode emoji into the text.

Code changes

This PR removes the extra character conversion steps from Bit.App.Utilities.AppHelpers::EncodeDataParameter. This change makes the data encoding in the method consistent with the decoding at apps/web/src/connectors/captcha.ts#L50 and apps/web/src/connectors/webauthn.ts#L88.

Analysis of the issue in the original implementation

The issue in the following code in the original implementation is that it applies unnecessary encoding to multibyte characters before the essential part, and there is no corresponding decoding in the end-consumer of the return value.

    public static string EncodeDataParameter(object obj)
    {
            string EncodeMultibyte(Match match)
            {
                return Convert.ToChar(Convert.ToUInt32($"0x{match.Groups[1].Value}", 16)).ToString();
            }

            var escaped = Uri.EscapeDataString(JsonConvert.SerializeObject(obj));
            var multiByteEscaped = Regex.Replace(escaped, "%([0-9A-F]{2})", EncodeMultibyte);
            return WebUtility.UrlEncode(Convert.ToBase64String(Encoding.UTF8.GetBytes(multiByteEscaped)));
    }

Design of the current implementation

The intention of this original implementation is to convert the given set of query parameters for web authenticators (web apps) into a url-safe string. The generated strings are passed into the query parameters of the web authenticators (https://vault.bitwarden.com/webauthn-mobile-connector.html) or (https://vault.bitwarden.com/captcha-mobile-connector.html).

The callers of EncodeDataParameter, Bit.App.Pages.CaptchaProtectedViewModel.HandleCaptchaAsync and Bit.App.Pages.TwoFactorPageViewModel.Fido2AuthenticateAsync place some localized UI texts into the parameter obj. Therefore, the implementation of EncodeDataParameter needs to safely convert multibyte characters in the message into url-safe forms, and the conversion must be consistent with the reverse conversion in the consumers.

For example, obj has a field btnText whose value is "WebAuthn の認証" when TwoFactorPageViewModel.Fido2AuthenticateAsync calls the method in Japanese locale. The value of btnText corresponds to the Fido2AuthenticateWebAuthn entry in the resource file.

What is actually happening behind the scene

Let's take a closer look at the first multibyte character in the btnText value, for example. The multibyte character is "の" (U+306E).

c.f. https://en.wikipedia.org/wiki/No_(kana)

Encoding into JSON preserves the character because ECMA-404 does not require character escapes in strings values except for " (double quotation) or \ (backslash).
Uri.EscapeDataString internally calls System.UriHelper.EscapeStringToBuilder and it encodes the character into its UTF-8 byte sequence representation, and then individual bytes in the sequence are encoded into the form of "%xx".
- The character U+306E is represented as 3 bytes of 0xE3, 0x81, 0xAE in UTF-8
- The byte sequence is then encoded into "%E3%81%AE"
The invocation of Regex.Replace replaces the individual occurrences of "%xx" form by calling EncodeMultibyte
- Convert.ToUInt32 in EncodeMultibyte converts the regex captures of "E3", "81", and "AE" into 0xE3u, 0x81u, and 0xAEu, respectively
- Convert.ToChar converts individual integer values into their corresponding unicode code points. 0xE3u, 0x81u, and 0xAEu are converted into the following characters, respectively:
  - ã (U+00E3, LATIN SMALL LETTER A WITH TILDE)
  - a non-printable control character (U+0081)
  - ® (U+00AE, REGISTERED SIGN)
Encoding.UTF8.GetBytes returns the UTF-8 byte sequence representation of the given string.
- The characters ã,
  , and ® are converted into their corresponding byte sequences as the followings:
  - ã (U+00E3) --> 0xC3, 0xA3
  - U+0081 --> 0xC2, 0x81
  - ® (U+00AE) --> 0xC2", 0xAE
Convert.ToBase64String converts the 6 bytes into "w6PCgcKu".

The consumer-side implementation in apps/web/src/connectors/captcha.ts#L50 or apps/web/src/connectors/webauthn.ts#L88 reverses just the step (5), (4), and (1). Therefore, we get ã® (U+00E3, U+0031, U+00AE) instead of "の" (U+306E).

Justification of the proposed fix

Just WebUtility.UrlEncode would be sufficient to keep the return value url-safe. It internally converts the given string into its UTF-8 byte sequence representation, and then convert the individual bytes into url-safe characters.

In practice, Base64 encoding is also necessary for consistency with the decoding in the end-consumer of the data.

Anyway, the earlier steps of (2), and (3) are not necessary in terms of the url-safety. Both of WebUtility.UrlEncode and Convert.ToBase64String(Encoding.UTF8.GetBytes(str)) are safe to multibyte characters. Actually, just (1), (4), (5) are what the consumer-side implementation reverses.

The steps (2) and (3) actually just make the encoded value incompatible to the implementation of the decoding. This is the reason why this PR removes the extra steps (2) and (3).

Comparison with ASCII characters.

You might wonder why this issue did not happen in earlier characters in the btnText string or any other messages in other many locales, e.g. in US English.

It is because the byte sequence representations in UTF-8 are equal to the Unicode codepoint values themselves for ASCII characters. Therefore, the combination of the extra steps (2) and (3) is effectively no-op for the characters.

For example, the character 'W' (U+0057) is represented as a 1-length sequence of bytes 0x57. The step (2) converts it into a string "%57". The EncodeMultibyte in the step (3) converts the regex capture of "57" into an integer 0x57u, and then Convert.ToChar converts it into a char value 'W' (U+0057).

The issue happens only when the strings in the target JSON object contains characters whose UTF-8 byte sequence is different from the Unicode codepoint itself.

Before you submit

Please check for formatting errors (dotnet format --verify-no-changes) (required)
Please add unit tests where it makes sense to do so (encouraged but not required)
If this change requires a documentation update - notify the documentation team
If this change has particular deployment requirements - notify the DevOps team

bitwarden / mobile