fmtlib / fmt

A modern formatting library
https://fmt.dev
Other
20.17k stars 2.45k forks source link

Formatting of strings that contain characters with code higher than 0x7f is broken after v6.1.2 #2858

Closed mmahnic closed 2 years ago

mmahnic commented 2 years ago

If a std::string contains characters with the high bit set, formatting to N places creates strings that are too large. This used to work correctly in version 6.1.2. At least with the Visual C++ 2022 compiler, the behavior of fmt::format is different than the behavior of std::format.

It looks like the formatter is trying to treat the input sequence as UTF-8. Although the standard explicitly states that the calculation of the width of strings without encoding is unspecified, and every implementation could use a different algorithm, I think the formatter should do as little as possible and not interpret the contents of a std::string, because it can contain any encoding. Interpretation should be done only for strings where the encoding is explicitly defined by the character type.

Here are some examples followed by the output on a Windows machine with the C locale.

#include <iostream>
#include <format>
#include <fmt/format.h>

int main()
{
   auto expect([](const std::string& expected, const std::string& actual) {
      auto hex([](const std::string& s) {
         std::string r;
         for (auto ch : s)
            r.append(std::format("{:02x}", (unsigned char)ch));
         return r;
         });
      if (expected != actual)
         std::cout << std::format("expected: {}, actual: {}\n",
            hex(expected), hex(actual));
      else std::cout << hex(actual) << " ok\n";
      });

   std::cout << "std/a: ";
   expect("**a", std::format("{:*>3}", "a"));

   std::cout << "std/c: ";
   expect("*\xC4\x8D", std::format("{:*>3}", "\xC4\x8D"));

   std::cout << "std/x: ";
   expect("*\x8D\x8D", std::format("{:*>3}", "\x8D\x8D"));

   std::cout << "fmt/a: ";
   expect("**a", fmt::format("{:*>3}", "a"));

   std::cout << "fmt/c: "; // valid utf-8 sequence, +1 byte (*)
   expect("*\xC4\x8D", fmt::format("{:*>3}", "\xC4\x8D"));

   std::cout << "fmt/x: "; // invalid utf-8 sequence, +2 bytes (**)
   expect("*\x8D\x8D", fmt::format("{:*>3}", "\x8D\x8D"));
}
std/a: 2a2a61 ok
std/c: 2ac48d ok
std/x: 2a8d8d ok
fmt/a: 2a2a61 ok
fmt/c: expected: 2ac48d, actual: 2a2ac48d
fmt/x: expected: 2a8d8d, actual: 2a2a2a8d8d
mmahnic commented 2 years ago

I have a suggestion for a future version that is related to this issue.

Formatting could be used in different scenarios, not only for displaying strings on a terminal. I think it would be useful if it would be possible to specify what the format width means. The standard specifies that for Unicode strings the width should be calculated from extended grapheme clusters, which is useful for displaying. But in other scenarios, it might be more useful to specify the width in bytes or code points.

For example:

// Format to 3 bytes
fmt::format("{:*>3b}", u8string("\xC4\x8D")) // 2a c4 8d

// Format to 3 code points (*, LATIN SMALL LETTER C WITH CARON, CARON)
fmt::format("{:*>3c}", u8string("\xC4\x8D\xCB\x87")) // 2a c4 8d cb 87

// Format to 3 extended grapheme clusters; this is the default.
fmt::format("{:*>3g}", u8string("\xC4\x8D\xCB\x87")) // 2a 2a c4 8d cb 87
vitaut commented 2 years ago

{fmt} assumes UTF-8 encoding by default. If you want to opt into a legacy encoding/binary wrap the argument in fmt::bytes, e.g.

auto s = fmt::format("{:*>3}", fmt::bytes("\xC4\x8D"));
mmahnic commented 2 years ago

This is unfortunate. We'd have to revise the occurrences of fmt::format in our whole codebase, around 500k lines. For now, we will keep using 6.1.2, then we'll move to C++20 and std::format where it looks like no encoding is assumed for raw strings and std::string.