Closed bold84 closed 1 week ago
Converting from UTF-16 to UTF-8 is no problem when retrieving data, because the column data type is known. When inserting/updating though, it is not so straightforward, as we don't have programmatic knowledge of the column data type in advance.
I'm thinking of adding another argument to "soci::use()" that lets the developer override the data type that's used for the underlying ODBC call.
Another issue is the currently non-existing N'' enclosure for unicode strings for MSSQL in case of soci::use().
Another issue is the stream interface. Currently std::wstring isn't supported and as far as I understand, supporting it would require widening the query to UTF-16 before sending it to the DB.
Please note that I updated the FreeBSD Image for Cirrus from 13.2 to 13.3.
I'm adding better UTF conversion first.
@vadz I extended the description; see "Limitation".
Maybe this can be an optional feature, similar to boost.
Oh, I forgot to ask: why do you think this should be an option? AFAICS this doesn't affect the existing API, so I see no reason to not enable this unconditionally for people who need it, am I missing something?
Oh, I forgot to ask: why do you think this should be an option? AFAICS this doesn't affect the existing API, so I see no reason to not enable this unconditionally for people who need it, am I missing something?
I was referring to the need to link against icu or iconv to have combining character support right away. But it's not necessary, if we can take care of the normalization later.
Sorry for getting so long to get back to this. I'm looking at whether we can merge it for SOCI 4.1.0 release and I wonder about the commented out tests here — does this mean that something still remains to be done or should we remove these tests for now and merge it nevertheless?
If so, I'll merge it with master, resolving the conflicts, and will merge it in if the CI jobs pass.
Regarding the commented out code: it tests a feature that works on Linux/Mac but not on Windows. On Unices, if I use the stream operator with wstring or string, it works. On windows, the input type has to match the column type (the one that's used internally). It seems like Microsoft's ODBC driver implicitly converts the string on Linux but not on Windows. I think it's a nice-to-have feature but it's not necessary.
I will merge master into this branch, so you have less work with it.
Thanks, I'm doing some final checks and I realize I have some more questions:
x_wchar
? I realize that it's symmetric to x_char
, but I just don't see what would it ever be useful for. I've never had to work with individual Unicode characters, only Unicode strings and a wchar_t
is not even capable of representing individual characters (if they require surrogates in UTF-16 under Windows). So maybe it would be simpler to just avoid introducing it?accumulate()
overload in ref_counted_statement_base
for wide strings. I wonder if we could find some way to avoid doing it, e.g. maybe adding to_string()
to exchange_traits
?soci-unicode.h
. Was there some particular reason to define them there or should we move the definitions for (some of) them to src/core/unicode.cpp
?std::wstring
for all the other backends, as I don't see how would you use it currently if you ever plan to use anything but SQL Server. I.e. I believe we should provide the default support for wstring
, converting it to UTF-8, and then override it with native handling in the ODBC backend, rather than not supporting it in all the other backends at all.What do you think?
Thanks, I'm doing some final checks and I realize I have some more questions:
- Do we really need to have
x_wchar
? I realize that it's symmetric tox_char
, but I just don't see what would it ever be useful for. I've never had to work with individual Unicode characters, only Unicode strings and awchar_t
is not even capable of representing individual characters (if they require surrogates in UTF-16 under Windows). So maybe it would be simpler to just avoid introducing it?
Not sure if it's really needed. There was a char, so I added a wchar_t. 🤷🏻♂️ But I agree with you, the value appears to be little.
- It's unfortunate to have to define a special
accumulate()
overload inref_counted_statement_base
for wide strings. I wonder if we could find some way to avoid doing it, e.g. maybe addingto_string()
toexchange_traits
?
I would have to look into that. But I can't promise when I can get to that. I have a full plate at the moment. 😕
- I'm not sure if we really want to have all the functions inline in
soci-unicode.h
. Was there some particular reason to define them there or should we move the definitions for (some of) them tosrc/core/unicode.cpp
?
Good point. I will do that today, as it doesn't take long.
- Last but the most important: I think we need to provide a fallback implementation of
std::wstring
for all the other backends, as I don't see how would you use it currently if you ever plan to use anything but SQL Server. I.e. I believe we should provide the default support forwstring
, converting it to UTF-8, and then override it with native handling in the ODBC backend, rather than not supporting it in all the other backends at all.What do you think?
I fully agree. But at the time, I didn't want to make this a larger PR than it already is (and I expected it to be). It's quite a lot of fun to get it all right on all 3 platforms, as you know. When do you plan to release 4.1?
Thanks, I can try to do (2) and do (3) myself, I just wanted to check if there was any reason not to.
As for 4.1, I wanted to make it a.s.a.p. just because I feel like I've already delayed it way too much...
Thanks, I can try to do (2) and do (3) myself, I just wanted to check if there was any reason not to.
As for 4.1, I wanted to make it a.s.a.p. just because I feel like I've already delayed it way too much...
Okay, then I will try to see if I can get 4 done using an LLM. I fed the relevant sources and the diff to 2 LLMs and one of the responses look very good for mysql. I will look into this later tonight.
Concerning (2): after looking at it again, I think this accumulate()
overload should be just removed. AFAICS it's only used in order to allow using session << wide_string
, but it doesn't really make sense to implicitly convert wide strings to UTF-8 here, this is inconsistent with the standard streams and hides an important conversion by making it implicit.
We want using wstring
, i.e. passing them to soci::use()
, to work, but I don't think we want to be able to use them directly in the query at all. Why would we want this?
Concerning (2): after looking at it again, I think this
accumulate()
overload should be just removed. AFAICS it's only used in order to allow usingsession << wide_string
, but it doesn't really make sense to implicitly convert wide strings to UTF-8 here, this is inconsistent with the standard streams and hides an important conversion by making it implicit.We want using
wstring
, i.e. passing them tosoci::use()
, to work, but I don't think we want to be able to use them directly in the query at all. Why would we want this?
So we don't have to convert them before using them. Naturally, there are always pros and cons. Consider all the code as an offer, I tried to make it as complete (for one backend) as possible.
Sorry, I still don't understand: under which circumstances would you have wide strings as part of the query instead of using them as parameters?
Anyhow, I've realized there is a worse problem here:
std::string str_out_utf8;
sql << "select wide_text from soci_test", into(str_out_utf8);
doesn't work correctly, the existing "MS SQL wide string" test only passes because it uses ASCII strings. Changing them to this:
std::wstring const str_in = L"Привет, SOCI!";
std::string const str_in_utf8 = "\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, SOCI!";
makes it fail, at least when using unixODBC SQL Server driver under Linux.
But actually I don't see anything that would even attempt to make this work, so I think it's just the test which is wrong and needs to be removed. I.e. there is no provision for retrieving SQL_WVARCHAR
columns as UTF-8-encoded strings at all there currently, is there?
Also, while doing (3) I've realized that this code looks like something extracted from another project as it's not written in SOCI style. What is its copyright/licence and can we include it in SOCI at all?
I've pushed my changes to #1179, but I still have important questions about the origin of the Unicode-related functions code.
Also, while doing (3) I've realized that this code looks like something extracted from another project as it's not written in SOCI style. What is its copyright/licence and can we include it in SOCI at all?
That is likely because of the influence that OpenAI's, Anthropic's and Alibaba's LLMs had on the code. 😉 I have not extracted any code from another project / source.
I ran the code through copyleaks.
Sorry, I still don't understand: under which circumstances would you have wide strings as part of the query instead of using them as parameters?
Anyhow, I've realized there is a worse problem here:
std::string str_out_utf8; sql << "select wide_text from soci_test", into(str_out_utf8);
doesn't work correctly, the existing "MS SQL wide string" test only passes because it uses ASCII strings. Changing them to this:
std::wstring const str_in = L"Привет, SOCI!"; std::string const str_in_utf8 = "\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, SOCI!";
makes it fail, at least when using unixODBC SQL Server driver under Linux.
Good catch. I think this was the first test I added and at the time it did not contain unicode strings yet, as I focussed only on wstring at first. So, I think this is an issue that should be fixed. I will look into this tomorrow (it's 1:38 here now).
OK, thanks, I think we could merge #1179 then and enhance it further in other PRs because I really believe that the other problem is just due to support for interoperability between SQL_WVARCHAR
and std::string
being not implemented at all, not a bug per se.
FYI, I did change/fix a couple of things in the Unicode functions (calling a void function is_xxx
is really confusing, and so is claiming that it returns something in its doc comments) and I'd like to change a few more things there too, but I think that we're perhaps going to switch to using ICU anyhow in the future, so I didn't want to spend time on this now.
1179
Thank you for the fixes. I think using ICU is definitely the better solution! Okay, I will add support for the other backends in new PRs.
Sorry, it's worse than I thought: it's broken under Windows when using native ODBC driver, see e.g. this build job. If even this doesn't work, we really can't merge it.
Sorry, it's worse than I thought: it's broken under Windows when using native ODBC driver, see e.g. this build job. If even this doesn't work, we really can't merge it.
Hmm... the tests did pass before. I am not sure what the cause is. I will have a look.
They passed before because they used 7 bit ASCII strings. I've modified them to use non-ASCII ones and checked that this still worked under Linux, but not under Windows -- where they fail.
1179
I see. I'm going to fix it now.
It's actually a limitation of the Microsoft compiler. A better explanation follows within the next 30 minutes.
When using wide string literals (L"…") in C++ on Windows, particularly with non-ASCII characters like Cyrillic, it's important to understand how these are interpreted:
Wide String and wchar_t
: On Windows, wchar_t
typically represents UTF-16 encoded values. However, direct use of wide string literals with non-ASCII characters could lead to misinterpretation if the source file's encoding and the compiler's expected encoding don't match.
Source File Encoding: The source file containing the code must be saved with an encoding that matches the expected encoding when compiled. Commonly, UTF-8 is used, but Windows by default might read it differently if not specified with BOM (Byte Order Mark) or compiler options.
Compiler Behavior: The Microsoft compiler (MSVC), for instance, may interpret characters in wide string literals based on the system's locale and code page if the encoding isn't explicitly defined, leading to misinterpretation of characters.
Usage of Unicode Escape Sequences: To ensure accurate representation irrespective of encoding issues, you can use Unicode escape sequences (e.g., L"\u041F"
) which define characters by their Unicode code point. This guarantees that characters are correctly represented as UTF-16, the intended wide character format for databases like MS SQL.
Database Consideration: Databases expecting UTF-16 encoding, like MS SQL for NVARCHAR columns, require that the application code correctly interprets and transmits the string in this format.
std::wstring const str_in = L"Привет, SOCI!";
leads to:
std::wstring const str_in = L"\u041F\u0440\u0438\u0432\u0435\u0442, SOCI!";
leads to:
This is also the reason I used this method in the unicode tests.
Here are the updated versions of the test methods:
TEST_CASE("MS SQL wide string", "[odbc][mssql][wstring]")
{
soci::session sql(backEnd, connectString);
wide_text_table_creator create_wide_text_table(sql);
// std::wstring const str_in = L"Привет, SOCI!";
std::wstring const str_in = L"\u041F\u0440\u0438\u0432\u0435\u0442, SOCI!";
sql << "insert into soci_test(wide_text) values(:str)", use(str_in);
std::wstring str_out;
sql << "select wide_text from soci_test", into(str_out);
CHECK(str_out == str_in);
}
TEST_CASE("MS SQL wide string vector", "[odbc][mssql][vector][wstring]")
{
soci::session sql(backEnd, connectString);
wide_text_table_creator create_wide_text_table(sql);
std::vector<std::wstring> const str_in = {
L"\u041F\u0440\u0438\u0432\u0435\u0442, SOCI!", // Привет, SOCI!
L"\u041F\u0440\u0438\u0432\u0435\u0442, World!", // Привет, World!
L"\u041F\u0440\u0438\u0432\u0435\u0442, Universe!", // Привет, Universe!
L"\u041F\u0440\u0438\u0432\u0435\u0442, Galaxy!"}; // Привет, Galaxy!
sql << "insert into soci_test(wide_text) values(:str)", use(str_in);
std::vector<std::wstring> str_out(4);
sql << "select wide_text from soci_test", into(str_out);
CHECK(str_out.size() == str_in.size());
for (std::size_t i = 0; i != str_in.size(); ++i)
{
CHECK(str_out[i] == str_in[i]);
}
}
Result:
Strange, I thought we already used /utf-8
because I saw that the tests using L"..."
in test-unicode.cpp
passed, how comes it works there but not in this file?
But we can use \u
, of course, even if it's not nice...
I remember that I have had lots of trouble before on Windows because of this issue. MSVC doesn't seems to be compliant. GCC and Clang are. But that's probably not a surprise.
We're using L"..."
and \u
.
Generally, the following should work as well:
std::u16string const utf16 = u"Привет, SOCI!";
std::wstring str_in(utf16.begin(), utf16.end());
but it doesn't with MSVC.
I created another test (scroll right for comments):
TEST_CASE("MS SQL utf8-to-utf16 string", "[odbc][mssql][wstring]")
{
soci::session sql(backEnd, connectString);
wide_text_table_creator create_wide_text_table(sql);
std::string const utf8 = "Привет, SOCI!"; // works
//std::string const utf8 = u8"Привет, SOCI!"; // doesn't work
// std::string const utf8 = "\xD0\x9F\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82\x2C\x20\x53\x4F\x43\x49\x21"; // works
std::u16string utf16 = soci::details::utf8_to_utf16(utf8);
std::wstring str_in(utf16.begin(), utf16.end());
sql << "insert into soci_test(wide_text) values(:str)", use(str_in);
std::wstring str_out;
sql << "select wide_text from soci_test", into(str_out);
CHECK(str_out == str_in);
}
OK, I figured it out: the tests in test-unicode.cpp
are broken too, it's just that they still pass because the LHS and RHS of comparisons are broken in the same way, so the tests are still true.
Also, the fact that scalar ODBC wstring test and the first 2 checks in the vector test pass is due to the fact that the column size is 40 and when interpreting the file as CP1252 and not UTF-8 (as MSVC did), the last 2 strings became longer than 40 characters, while the first 2 ones still fit.
Finally, while debugging this, I've reread the code more carefully and I see many redundant comments and other things which don't make sense (why have utf16_to_wide()
only to never use it and duplicate what it does in the code which needs it?). I realize now that it must have been generated by an LLM and while I'm not categorically against using LLMs, like some other people, I am strongly against not reviewing their results carefully. I'll fix them myself for this PR but I really hope you can do it if you keep using them.
It's really not great that the more I look at this code, the more problems I find in it :-(
Dividing the column size by sizeof(SQLWCHAR)
is wrong, the column size is in characters, not in bytes.
Great that you were able to figure this out. I wasn't able to back then. And due to posts on numerous websites (StackOverflow and others) suggesting an issue with MSVC while the builds with GCC and Clang worked, I ended up believing that it's really MSVC. I took a wrong turn when I should have dug deeper. Clearly a mistake by me. 🤷🏻♂️
The utf16_to_wide
was used before but the consuming code was removed.
Similar to how utf8_to_wide
usage was removed when you removed the accumulate overload.
I'm much more aware of (and experienced with) the issues that come with LLM generated code than I was when most of this code was created.
It's a large change/addition, created over a period of 1.5 years with several long interruptions. I would be surprised if you didn't find any issue in it. Without LLM usage, you wouldn't have seen a "division by sizeof(SQLWCHAR)
" though. 😅
I've removed most of unnecessary LLM-generated stuff and made conversions more efficient to avoid copying potentially huge strings multiple times.
There is still a lot of completely unused code remaining, notably conversions from UTF-8 and also one of UTF-16/32 conversions to it. I'm not sure if it should be removed or preserved to be used later for wstring
support in the other backends, for now I'm leaving it here just because I already spent vastly more time on this than I ever planned to.
I'm also closing this PR and will merge #1179 if nobody sees any other problems with it.
This pull request adds comprehensive support for wide strings (
wchar_t
,std::wstring
) to the SOCI database library, significantly improving its support for Unicode string types such as SQL Server'sNVARCHAR
andNTEXT
. This enhancement is crucial for applications that require robust handling of international and multi-language data.Key Changes:
Introduced
exchange_type_traits
andexchange_traits
Specializations:Updated ODBC Backend:
wchar_t
andstd::wstring
.Enhanced Buffer Management:
Improved Unicode Support:
Extended Test Coverage:
Notes:
This update significantly bolsters SOCI's capabilities in handling Unicode data, making it a more versatile and powerful tool for database interactions in multi-language applications.
Example usage
Here are a few examples showing how the new wide string features can be used with the ODBC backend.
Example 1: Handling
std::wstring
in SQL QueriesInserting and Selecting
std::wstring
DataExample 2: Working with
wchar_t
VectorsInserting and Selecting Wide Characters
Example 3: Using
std::wstring
with thesql
Stream OperatorInserting and Selecting
std::wstring
Data with Stream OperatorIn this example:
soci::session
object is created to connect to the database.NVARCHAR
column.std::wstring
is defined for insertion.sql
stream operator is used to insert thestd::wstring
into the database. Note the use ofN'
to indicate a Unicode string in SQL Server.std::wstring
is retrieved from the database using thesql
stream operator and thesoci::into
function.std::wcout
.These examples demonstrate how to insert and retrieve wide strings and wide characters using SOCI's newly added features for handling wide strings (
wchar_t
,std::wstring
).Limitation: The current implementation does not handle combining characters correctly. Combining characters, such as accents or diacritical marks, are treated separately instead of being combined with their base characters. This limitation may result in incorrect conversions for strings containing combining characters. A potential solution would be to incorporate Unicode normalization before the conversion process to ensure that combining characters are properly combined with their base characters. Unicode defines several normalization forms (e.g., NFC, NFD, NFKC, NFKD), each with its own set of rules and behaviors. Choosing the appropriate normalization form is crucial, as different forms may produce different results.
To have full Unicode support, linking against a library like ICU or iconv is necessary. It can be made optional.
Disclaimer: This text is partially AI generated.