fix wrong buffer size in path string conversion functions

striezel commented 4 months ago

Description

Fixes #733.

Issue: The length calculation of the destination buffers for conversions from std::wstring to std::string or wchar_t* to char* is incorrect, causing a buffer overflow in some cases.

Explanation: The current code uses wcslen() to "calculate" the length of the converted destination string buffer. However, wcslen() is just a strlen() for wchar_t* strings. So it returns the number of wide characters in a wchar_t* string, but this does not need to be the same as the number of narrow characters (that is: char) in the converted string (char* or std::string). It just happens to be the same, if the wide character string only uses characters of the English alphabet where each wide character is converted to exactly one narrow character.

So the length has to be calculated properly. There are (at least) two possibilities:

Use wcstombs()'s POSIX extension to calculate the length before doing the conversion.

POSIX specifies a common extension: if dst is a null pointer, this function returns the number of bytes that would be written to dst, if converted.

But that is an extension and cannot be relied upon to be present.
We should use wcsrtombs() instead, where that behaviour is not an extension but part of the function and we do not need to rely on the presence of an extension. This is what this pull request does.

References

Tasklist

[ ] Ensure all CI builds pass
[ ] Review and approve

simmplecoder commented 3 months ago

I dont know much about multibyte string conversions, but I will review it to the best of my ability tomorrow.

striezel commented 3 months ago

I dont know much about multibyte string conversions, but I will review it to the best of my ability tomorrow.

Just to explain the basics (may not be completely accurate, but it should be enough to get a basic understanding): In this code, there are two different kinds of string: wide strings using wchar_t for character representation (used in std::wstring or C-style wchar_t* strings) and narrow strings using char for character representation (used by std::string or C-style char* strings). On Windows systems wchar_t is 16 bits and contains UTF-16 code units. On other systems it may be 32 bits and contain UTF-32, but in practice it is always more than 8 bits. In contrast, char only has 8 bits and therefore usually contains UTF-8 code units. (A narrow string could also contain ASCII, ISO-8859-1, etc., but listing and explaining all possibilities would go to far here. Therefore, let's assume that it is UTF-8.)

So when a string is converted from a wide string (wchar_t / UTF-16) to a narrow string (char / UTF-8) the strings do not necessarily have the same number of code points. If a string only uses letters from the English alphabet, then both wide and narrow string can have the same length, because one wchar_t can be converted to exactly one char. However, when the wide strings use characters from a more "exotic" alphabet, say Cyrillic letters or or Chinese letters, the situation is different. A Cyrillic or Chinese letter will only use one code unit in UTF-16, that is it can be represented by a single wchar_t. But in UTF-8 that same letter will need multiple code units (two or three), meaning it needs more than one char to be represented properly. That is why one wchar_t may be converted to more than one char and the length of those strings can differ after conversion. Therefore, the length of the buffer for the converted string has to be calculated before conversion. Otherwise it may be too short to hold the entirety of the converted string.

Here, the length calculation is done by wcsrtombs(), and then the actual conversion is done by wcstombs().

simmplecoder commented 3 months ago

Thanks for the explanation. I suppose the easiest solution is to throw boost.locale at it to completely nuke it, but I am not sure if it is worth it. I suppose people can open their streams and pass those if they want to do anything exotic.

striezel commented 3 months ago

Thanks for the commits, I learned a lot from reading the description. From the include chain I couldn't find #include <cwchar>, could you please add that? std::wcsrtombs seems to be declared there.

You are right. <cwchar> is where the function is declared, so I added it. (It's a bit surprising that the CI passed on all those GCC, Clang and MSVC variants without it.)

I looked at std::codecvt as possible C++ refactoring, but unfortunately it was disappointing.

Yes. As far as I remember, parts of that are deprecated in newer C++ standards.

striezel commented 3 months ago

I suppose the easiest solution is to throw boost.locale at it to completely nuke it, but I am not sure if it is worth it. I suppose people can open their streams and pass those if they want to do anything exotic.

As far as I understand it, some parts of Boost.Locale also need the ICU library (International Components for Unicode) and that library is ca. one order of magnitude larger than Boost.Locale itself (counting file sizes for installed libraries here). Furthermore, GIL does not have any dependency on Boost.Locale as of now, so this would potentially add two new dependencies to GIL. That kind of overhead may not be acceptable to some users of Boost.GIL, especially when we can have a smaller solution with the help of the standard library.

mloskot commented 3 months ago

@striezel

Furthermore, GIL does not have any dependency on Boost.Locale as of now, so this would potentially add two new dependencies to GIL. That kind of overhead may not be acceptable to some users of Boost.GIL, especially when we can have a smaller solution with the help of the standard library.

I agree, we should try to keep the list of required dependencies as short as possible. But, I don't see any issue with introducing optional dependencies for features that can be controlled via CMake option/conditional compilation. The only issue I see is the potential maintenance overhead.

boostorg / gil