Support for Wide Strings in SOCI for Enhanced Unicode Handling

bold84 commented 8 months ago

This pull request adds comprehensive support for wide strings (wchar_t, std::wstring) to the SOCI database library, significantly improving its support for Unicode string types such as SQL Server's NVARCHAR and NTEXT. This enhancement is crucial for applications that require robust handling of international and multi-language data.

Key Changes:

Introduced exchange_type_traits and exchange_traits Specializations:
- These specializations facilitate the handling of wide strings during type exchange, ensuring proper conversion and management within the SOCI library.
Updated ODBC Backend:
- Added support for wide strings, specifically for wchar_t and std::wstring.
- Adjusted the parameter binding and data retrieval mechanisms to correctly process wide characters.
Enhanced Buffer Management:
- Modified buffer allocation and management to accommodate the larger size of wide characters, which are essential for proper Unicode support.
- Implemented logic to handle buffer size overflow, ensuring safety and stability when processing large text data.
Improved Unicode Support:
- Incorporated routines to convert between different Unicode encodings (UTF-16 and UTF-32 on Unix-like systems, native UTF-16 on Windows) to handle wide strings properly across various platforms.
Extended Test Coverage:
- Added comprehensive tests focusing on wide string handling, especially ensuring compatibility with SQL Server.
- Included edge cases for large strings to test buffer management and overflow handling.

Notes:

The focus of this pull request is on the ODBC backend.
While the original pull request mentioned a focus on C++17 standards, this has been removed to maintain compatibility with earlier C++ versions, ensuring broader usability of these enhancements.
This work sets a foundational framework to extend wide string support to other database backends in SOCI in future updates.

This update significantly bolsters SOCI's capabilities in handling Unicode data, making it a more versatile and powerful tool for database interactions in multi-language applications.

Example usage

Here are a few examples showing how the new wide string features can be used with the ODBC backend.

Example 1: Handling `std::wstring` in SQL Queries

Inserting and Selecting `std::wstring` Data

#include <soci.h>
#include <soci-odbc.h>
#include <iostream>

int main()
{
    try
    {
        soci::session sql(soci::odbc, "DSN=my_datasource;UID=user;PWD=password");

        // Create table with NVARCHAR column
        sql << "CREATE TABLE soci_test (id INT IDENTITY PRIMARY KEY, wide_text NVARCHAR(40) NULL)";

        // Define a wstring to insert
        std::wstring wide_str_in = L"Hello, 世界!";

        // Insert the wstring
        sql << "INSERT INTO soci_test(wide_text) VALUES (:wide_text)", soci::use(wide_str_in);

        // Retrieve the wstring
        std::wstring wide_str_out;
        sql << "SELECT wide_text FROM soci_test WHERE id = 1", soci::into(wide_str_out);

        // Output the retrieved wstring
        std::wcout << L"Retrieved wide string: " << wide_str_out << std::endl;
    }
    catch (const soci::soci_error& e)
    {
        std::cerr << "Error: " << e.what() << std::endl;
    }

    return 0;
}

Example 2: Working with `wchar_t` Vectors

Inserting and Selecting Wide Characters

#include <soci.h>
#include <soci-odbc.h>
#include <iostream>
#include <vector>

int main()
{
    try
    {
        soci::session sql(soci::odbc, "DSN=my_datasource;UID=user;PWD=password");

        // Create table with NCHAR column
        sql << "CREATE TABLE soci_test (id INT IDENTITY PRIMARY KEY, wide_char NCHAR(2) NULL)";

        // Define a vector of wide characters to insert
        std::vector<wchar_t> wide_chars_in = {L'A', L'B', L'C', L'D'};

        // Insert the wide characters
        sql << "INSERT INTO soci_test(wide_char) VALUES (:wide_char)", soci::use(wide_chars_in);

        // Retrieve the wide characters
        std::vector<wchar_t> wide_chars_out(4);
        sql << "SELECT wide_char FROM soci_test WHERE id IN (1, 2, 3, 4)", soci::into(wide_chars_out);

        // Output the retrieved wide characters
        for (wchar_t ch : wide_chars_out)
        {
            std::wcout << L"Retrieved wide char: " << ch << std::endl;
        }
    }
    catch (const soci::soci_error& e)
    {
        std::cerr << "Error: " << e.what() << std::endl;
    }

    return 0;
}

Example 3: Using `std::wstring` with the `sql` Stream Operator

Inserting and Selecting `std::wstring` Data with Stream Operator

#include <soci.h>
#include <soci-odbc.h>
#include <iostream>

int main()
{
    try
    {
        soci::session sql(soci::odbc, "DSN=my_datasource;UID=user;PWD=password");

        // Create table with NVARCHAR column
        sql << "CREATE TABLE soci_test (id INT IDENTITY PRIMARY KEY, wide_text NVARCHAR(40) NULL)";

        // Define a wstring to insert
        std::wstring wide_str_in = L"Hello, 世界!";

        // Use stream operator to insert the wstring
        sql << "INSERT INTO soci_test(wide_text) VALUES (N'" << wide_str_in << "')";

        // Retrieve the wstring using stream operator
        std::wstring wide_str_out;
        sql << "SELECT wide_text FROM soci_test WHERE id = 1", soci::into(wide_str_out);

        // Output the retrieved wstring
        std::wcout << L"Retrieved wide string: " << wide_str_out << std::endl;
    }
    catch (const soci::soci_error& e)
    {
        std::cerr << "Error: " << e.what() << std::endl;
    }

    return 0;
}

In this example:

A soci::session object is created to connect to the database.
A table is created with an NVARCHAR column.
A std::wstring is defined for insertion.
The sql stream operator is used to insert the std::wstring into the database. Note the use of N' to indicate a Unicode string in SQL Server.
The std::wstring is retrieved from the database using the sql stream operator and the soci::into function.
The retrieved wide string is printed to the console using std::wcout.

These examples demonstrate how to insert and retrieve wide strings and wide characters using SOCI's newly added features for handling wide strings (wchar_t, std::wstring).

Limitation: The current implementation does not handle combining characters correctly. Combining characters, such as accents or diacritical marks, are treated separately instead of being combined with their base characters. This limitation may result in incorrect conversions for strings containing combining characters. A potential solution would be to incorporate Unicode normalization before the conversion process to ensure that combining characters are properly combined with their base characters. Unicode defines several normalization forms (e.g., NFC, NFD, NFKC, NFKD), each with its own set of rules and behaviors. Choosing the appropriate normalization form is crucial, as different forms may produce different results.

To have full Unicode support, linking against a library like ICU or iconv is necessary. It can be made optional.

Disclaimer: This text is partially AI generated.

bold84 commented 8 months ago

Converting from UTF-16 to UTF-8 is no problem when retrieving data, because the column data type is known. When inserting/updating though, it is not so straightforward, as we don't have programmatic knowledge of the column data type in advance.

I'm thinking of adding another argument to "soci::use()" that lets the developer override the data type that's used for the underlying ODBC call.

Another issue is the currently non-existing N'' enclosure for unicode strings for MSSQL in case of soci::use().

Another issue is the stream interface. Currently std::wstring isn't supported and as far as I understand, supporting it would require widening the query to UTF-16 before sending it to the DB.

bold84 commented 5 months ago

Please note that I updated the FreeBSD Image for Cirrus from 13.2 to 13.3.

https://github.com/cirruslabs/cirrus-ci-docs/pull/1277

bold84 commented 5 months ago

I'm adding better UTF conversion first.

bold84 commented 4 months ago

@vadz I extended the description; see "Limitation".

Maybe this can be an optional feature, similar to boost.

vadz commented 4 months ago

Oh, I forgot to ask: why do you think this should be an option? AFAICS this doesn't affect the existing API, so I see no reason to not enable this unconditionally for people who need it, am I missing something?

bold84 commented 4 months ago

Oh, I forgot to ask: why do you think this should be an option? AFAICS this doesn't affect the existing API, so I see no reason to not enable this unconditionally for people who need it, am I missing something?

I was referring to the need to link against icu or iconv to have combining character support right away. But it's not necessary, if we can take care of the normalization later.

vadz commented 1 week ago

Sorry for getting so long to get back to this. I'm looking at whether we can merge it for SOCI 4.1.0 release and I wonder about the commented out tests here — does this mean that something still remains to be done or should we remove these tests for now and merge it nevertheless?

If so, I'll merge it with master, resolving the conflicts, and will merge it in if the CI jobs pass.

bold84 commented 1 week ago

Regarding the commented out code: it tests a feature that works on Linux/Mac but not on Windows. On Unices, if I use the stream operator with wstring or string, it works. On windows, the input type has to match the column type (the one that's used internally). It seems like Microsoft's ODBC driver implicitly converts the string on Linux but not on Windows. I think it's a nice-to-have feature but it's not necessary.

I will merge master into this branch, so you have less work with it.

vadz commented 1 week ago

Thanks, I'm doing some final checks and I realize I have some more questions:

Do we really need to have x_wchar? I realize that it's symmetric to x_char, but I just don't see what would it ever be useful for. I've never had to work with individual Unicode characters, only Unicode strings and a wchar_t is not even capable of representing individual characters (if they require surrogates in UTF-16 under Windows). So maybe it would be simpler to just avoid introducing it?
It's unfortunate to have to define a special accumulate() overload in ref_counted_statement_base for wide strings. I wonder if we could find some way to avoid doing it, e.g. maybe adding to_string() to exchange_traits?
I'm not sure if we really want to have all the functions inline in soci-unicode.h. Was there some particular reason to define them there or should we move the definitions for (some of) them to src/core/unicode.cpp?
Last but the most important: I think we need to provide a fallback implementation of std::wstring for all the other backends, as I don't see how would you use it currently if you ever plan to use anything but SQL Server. I.e. I believe we should provide the default support for wstring, converting it to UTF-8, and then override it with native handling in the ODBC backend, rather than not supporting it in all the other backends at all.

What do you think?

bold84 commented 1 week ago

Thanks, I'm doing some final checks and I realize I have some more questions:

Do we really need to have x_wchar? I realize that it's symmetric to x_char, but I just don't see what would it ever be useful for. I've never had to work with individual Unicode characters, only Unicode strings and a wchar_t is not even capable of representing individual characters (if they require surrogates in UTF-16 under Windows). So maybe it would be simpler to just avoid introducing it?

Not sure if it's really needed. There was a char, so I added a wchar_t. 🤷🏻‍♂️ But I agree with you, the value appears to be little.

It's unfortunate to have to define a special accumulate() overload in ref_counted_statement_base for wide strings. I wonder if we could find some way to avoid doing it, e.g. maybe adding to_string() to exchange_traits?

I would have to look into that. But I can't promise when I can get to that. I have a full plate at the moment. 😕

I'm not sure if we really want to have all the functions inline in soci-unicode.h. Was there some particular reason to define them there or should we move the definitions for (some of) them to src/core/unicode.cpp?

Good point. I will do that today, as it doesn't take long.

Last but the most important: I think we need to provide a fallback implementation of std::wstring for all the other backends, as I don't see how would you use it currently if you ever plan to use anything but SQL Server. I.e. I believe we should provide the default support for wstring, converting it to UTF-8, and then override it with native handling in the ODBC backend, rather than not supporting it in all the other backends at all.

What do you think?

I fully agree. But at the time, I didn't want to make this a larger PR than it already is (and I expected it to be). It's quite a lot of fun to get it all right on all 3 platforms, as you know. When do you plan to release 4.1?

vadz commented 1 week ago

Thanks, I can try to do (2) and do (3) myself, I just wanted to check if there was any reason not to.

As for 4.1, I wanted to make it a.s.a.p. just because I feel like I've already delayed it way too much...

bold84 commented 1 week ago

Thanks, I can try to do (2) and do (3) myself, I just wanted to check if there was any reason not to.

As for 4.1, I wanted to make it a.s.a.p. just because I feel like I've already delayed it way too much...

Okay, then I will try to see if I can get 4 done using an LLM. I fed the relevant sources and the diff to 2 LLMs and one of the responses look very good for mysql. I will look into this later tonight.

vadz commented 1 week ago

Concerning (2): after looking at it again, I think this accumulate() overload should be just removed. AFAICS it's only used in order to allow using session << wide_string, but it doesn't really make sense to implicitly convert wide strings to UTF-8 here, this is inconsistent with the standard streams and hides an important conversion by making it implicit.

We want using wstring, i.e. passing them to soci::use(), to work, but I don't think we want to be able to use them directly in the query at all. Why would we want this?

bold84 commented 1 week ago

Concerning (2): after looking at it again, I think this accumulate() overload should be just removed. AFAICS it's only used in order to allow using session << wide_string, but it doesn't really make sense to implicitly convert wide strings to UTF-8 here, this is inconsistent with the standard streams and hides an important conversion by making it implicit.

We want using wstring, i.e. passing them to soci::use(), to work, but I don't think we want to be able to use them directly in the query at all. Why would we want this?

So we don't have to convert them before using them. Naturally, there are always pros and cons. Consider all the code as an offer, I tried to make it as complete (for one backend) as possible.

vadz commented 1 week ago

Sorry, I still don't understand: under which circumstances would you have wide strings as part of the query instead of using them as parameters?

Anyhow, I've realized there is a worse problem here:

  std::string str_out_utf8;
  sql << "select wide_text from soci_test", into(str_out_utf8);

doesn't work correctly, the existing "MS SQL wide string" test only passes because it uses ASCII strings. Changing them to this:

  std::wstring const str_in = L"Привет, SOCI!";
  std::string const str_in_utf8 = "\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, SOCI!";

makes it fail, at least when using unixODBC SQL Server driver under Linux.

vadz commented 1 week ago

But actually I don't see anything that would even attempt to make this work, so I think it's just the test which is wrong and needs to be removed. I.e. there is no provision for retrieving SQL_WVARCHAR columns as UTF-8-encoded strings at all there currently, is there?

vadz commented 1 week ago

Also, while doing (3) I've realized that this code looks like something extracted from another project as it's not written in SOCI style. What is its copyright/licence and can we include it in SOCI at all?

vadz commented 1 week ago

I've pushed my changes to #1179, but I still have important questions about the origin of the Unicode-related functions code.

bold84 commented 1 week ago

Also, while doing (3) I've realized that this code looks like something extracted from another project as it's not written in SOCI style. What is its copyright/licence and can we include it in SOCI at all?

That is likely because of the influence that OpenAI's, Anthropic's and Alibaba's LLMs had on the code. 😉 I have not extracted any code from another project / source.

bold84 commented 1 week ago

I ran the code through copyleaks.

CleanShot 2024-11-16 at 01 31 44@2x

soci-unicode.h - report.pdf

bold84 commented 1 week ago

Sorry, I still don't understand: under which circumstances would you have wide strings as part of the query instead of using them as parameters?

Anyhow, I've realized there is a worse problem here:
  std::string str_out_utf8;
  sql << "select wide_text from soci_test", into(str_out_utf8);
doesn't work correctly, the existing "MS SQL wide string" test only passes because it uses ASCII strings. Changing them to this:
  std::wstring const str_in = L"Привет, SOCI!";
  std::string const str_in_utf8 = "\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, SOCI!";
makes it fail, at least when using unixODBC SQL Server driver under Linux.

Good catch. I think this was the first test I added and at the time it did not contain unicode strings yet, as I focussed only on wstring at first. So, I think this is an issue that should be fixed. I will look into this tomorrow (it's 1:38 here now).

vadz commented 1 week ago

OK, thanks, I think we could merge #1179 then and enhance it further in other PRs because I really believe that the other problem is just due to support for interoperability between SQL_WVARCHAR and std::string being not implemented at all, not a bug per se.

FYI, I did change/fix a couple of things in the Unicode functions (calling a void function is_xxx is really confusing, and so is claiming that it returns something in its doc comments) and I'd like to change a few more things there too, but I think that we're perhaps going to switch to using ICU anyhow in the future, so I didn't want to spend time on this now.

bold84 commented 1 week ago

1179

Thank you for the fixes. I think using ICU is definitely the better solution! Okay, I will add support for the other backends in new PRs.

vadz commented 1 week ago

Sorry, it's worse than I thought: it's broken under Windows when using native ODBC driver, see e.g. this build job. If even this doesn't work, we really can't merge it.

bold84 commented 1 week ago

Sorry, it's worse than I thought: it's broken under Windows when using native ODBC driver, see e.g. this build job. If even this doesn't work, we really can't merge it.

Hmm... the tests did pass before. I am not sure what the cause is. I will have a look.

vadz commented 1 week ago

They passed before because they used 7 bit ASCII strings. I've modified them to use non-ASCII ones and checked that this still worked under Linux, but not under Windows -- where they fail.

bold84 commented 1 week ago

1179

I see. I'm going to fix it now.

bold84 commented 1 week ago

It's actually a limitation of the Microsoft compiler. A better explanation follows within the next 30 minutes.

bold84 commented 1 week ago

When using wide string literals (L"…") in C++ on Windows, particularly with non-ASCII characters like Cyrillic, it's important to understand how these are interpreted:

Wide String and wchar_t: On Windows, wchar_t typically represents UTF-16 encoded values. However, direct use of wide string literals with non-ASCII characters could lead to misinterpretation if the source file's encoding and the compiler's expected encoding don't match.
Source File Encoding: The source file containing the code must be saved with an encoding that matches the expected encoding when compiled. Commonly, UTF-8 is used, but Windows by default might read it differently if not specified with BOM (Byte Order Mark) or compiler options.
Compiler Behavior: The Microsoft compiler (MSVC), for instance, may interpret characters in wide string literals based on the system's locale and code page if the encoding isn't explicitly defined, leading to misinterpretation of characters.
Usage of Unicode Escape Sequences: To ensure accurate representation irrespective of encoding issues, you can use Unicode escape sequences (e.g., L"\u041F") which define characters by their Unicode code point. This guarantees that characters are correctly represented as UTF-16, the intended wide character format for databases like MS SQL.
Database Consideration: Databases expecting UTF-16 encoding, like MS SQL for NVARCHAR columns, require that the application code correctly interprets and transmits the string in this format.

std::wstring const str_in = L"Привет, SOCI!";

leads to:

CleanShot 2024-11-16 at 23 09 17@2x

std::wstring const str_in = L"\u041F\u0440\u0438\u0432\u0435\u0442, SOCI!";

leads to:

CleanShot 2024-11-16 at 23 10 34@2x

This is also the reason I used this method in the unicode tests.

Here are the updated versions of the test methods:

TEST_CASE("MS SQL wide string", "[odbc][mssql][wstring]")
{
  soci::session sql(backEnd, connectString);

  wide_text_table_creator create_wide_text_table(sql);

  // std::wstring const str_in = L"Привет, SOCI!";
  std::wstring const str_in = L"\u041F\u0440\u0438\u0432\u0435\u0442, SOCI!";

  sql << "insert into soci_test(wide_text) values(:str)", use(str_in);

  std::wstring str_out;
  sql << "select wide_text from soci_test", into(str_out);

  CHECK(str_out == str_in);
}

TEST_CASE("MS SQL wide string vector", "[odbc][mssql][vector][wstring]")
{
  soci::session sql(backEnd, connectString);

  wide_text_table_creator create_wide_text_table(sql);

  std::vector<std::wstring> const str_in = {
      L"\u041F\u0440\u0438\u0432\u0435\u0442, SOCI!", // Привет, SOCI!
      L"\u041F\u0440\u0438\u0432\u0435\u0442, World!", // Привет, World!
      L"\u041F\u0440\u0438\u0432\u0435\u0442, Universe!", // Привет, Universe!
      L"\u041F\u0440\u0438\u0432\u0435\u0442, Galaxy!"}; // Привет, Galaxy!

  sql << "insert into soci_test(wide_text) values(:str)", use(str_in);

  std::vector<std::wstring> str_out(4);

  sql << "select wide_text from soci_test", into(str_out);

  CHECK(str_out.size() == str_in.size());
  for (std::size_t i = 0; i != str_in.size(); ++i)
  {
    CHECK(str_out[i] == str_in[i]);
  }
}

Result:

CleanShot 2024-11-16 at 23 13 53@2x

bold84 commented 1 week ago

https://github.com/SOCI/soci/commit/c00baa41cf7d7ffbb301b6e99fbdcd98af74dbec

vadz commented 1 week ago

Strange, I thought we already used /utf-8 because I saw that the tests using L"..." in test-unicode.cpp passed, how comes it works there but not in this file?

But we can use \u, of course, even if it's not nice...

bold84 commented 1 week ago

I remember that I have had lots of trouble before on Windows because of this issue. MSVC doesn't seems to be compliant. GCC and Clang are. But that's probably not a surprise.

bold84 commented 1 week ago

We're using L"..." and \u.

Generally, the following should work as well:

std::u16string const utf16 = u"Привет, SOCI!";
std::wstring str_in(utf16.begin(), utf16.end());

but it doesn't with MSVC.

I created another test (scroll right for comments):

 TEST_CASE("MS SQL utf8-to-utf16 string", "[odbc][mssql][wstring]")
{
  soci::session sql(backEnd, connectString);

  wide_text_table_creator create_wide_text_table(sql);

  std::string const utf8 = "Привет, SOCI!";                                                                   // works
  //std::string const utf8 = u8"Привет, SOCI!";                                                               // doesn't work
  // std::string const utf8 = "\xD0\x9F\xD1\x80\xD0\xB8\xD0\xB2\xD0\xB5\xD1\x82\x2C\x20\x53\x4F\x43\x49\x21"; // works
  std::u16string utf16 = soci::details::utf8_to_utf16(utf8);
  std::wstring str_in(utf16.begin(), utf16.end());

  sql << "insert into soci_test(wide_text) values(:str)", use(str_in);

  std::wstring str_out;
  sql << "select wide_text from soci_test", into(str_out);

  CHECK(str_out == str_in);
}

vadz commented 1 week ago

OK, I figured it out: the tests in test-unicode.cpp are broken too, it's just that they still pass because the LHS and RHS of comparisons are broken in the same way, so the tests are still true.

Also, the fact that scalar ODBC wstring test and the first 2 checks in the vector test pass is due to the fact that the column size is 40 and when interpreting the file as CP1252 and not UTF-8 (as MSVC did), the last 2 strings became longer than 40 characters, while the first 2 ones still fit.

Finally, while debugging this, I've reread the code more carefully and I see many redundant comments and other things which don't make sense (why have utf16_to_wide() only to never use it and duplicate what it does in the code which needs it?). I realize now that it must have been generated by an LLM and while I'm not categorically against using LLMs, like some other people, I am strongly against not reviewing their results carefully. I'll fix them myself for this PR but I really hope you can do it if you keep using them.

vadz commented 1 week ago

It's really not great that the more I look at this code, the more problems I find in it :-(

Dividing the column size by sizeof(SQLWCHAR) is wrong, the column size is in characters, not in bytes.

bold84 commented 1 week ago

Great that you were able to figure this out. I wasn't able to back then. And due to posts on numerous websites (StackOverflow and others) suggesting an issue with MSVC while the builds with GCC and Clang worked, I ended up believing that it's really MSVC. I took a wrong turn when I should have dug deeper. Clearly a mistake by me. 🤷🏻‍♂️

The utf16_to_wide was used before but the consuming code was removed. Similar to how utf8_to_wide usage was removed when you removed the accumulate overload.

I'm much more aware of (and experienced with) the issues that come with LLM generated code than I was when most of this code was created. It's a large change/addition, created over a period of 1.5 years with several long interruptions. I would be surprised if you didn't find any issue in it. Without LLM usage, you wouldn't have seen a "division by sizeof(SQLWCHAR)" though. 😅

vadz commented 1 week ago

I've removed most of unnecessary LLM-generated stuff and made conversions more efficient to avoid copying potentially huge strings multiple times.

There is still a lot of completely unused code remaining, notably conversions from UTF-8 and also one of UTF-16/32 conversions to it. I'm not sure if it should be removed or preserved to be used later for wstring support in the other backends, for now I'm leaving it here just because I already spent vastly more time on this than I ever planned to.

I'm also closing this PR and will merge #1179 if nobody sees any other problems with it.

SOCI / soci