UTF-16 support for ODBC (for MSSQL)

bold84 commented 1 year ago

This pull request adds support for UTF-16 encoding in the ODBC module. The module previously only supported UTF-8 encoding, which made it difficult to exchange data with non-UTF-8 MSSQL databases properly.

With this new feature, the ODBC module can now properly exchange data with MSSQL databases that use UTF-16 encoding. This is especially important for users who work with databases that use different encodings or require internationalization support.

The implementation uses std::wstring_convert to convert between UTF-8 and UTF-16 encodings. The toUtf16() and toUtf8() functions have been added to handle the conversion between the two encodings.

There seem to have been issues in the past:

https://github.com/SOCI/soci/issues/164 https://github.com/SOCI/soci/issues/179 https://github.com/SOCI/soci/issues/1111

bold84 commented 6 months ago

Thanks for working on this, but I don't know how do I feel about introducing a separate build variant for Unicode support. IMO it would be really better to just use UTF-8 in all builds, instead of requiring a special build mode to handle Unicode.

And it would be really nice to have some description of this option in the docs, if only to explain what does enabling it change.

Finally, even if this is relatively trivial, it looks like the use of SQLTCHAR could avoid some preprocessor checks in the code.

The problem I have is that the data has been written already with another application and I have no influence on that. The database(s) are not UTF-8.

The changes in this PR were the only way I could figure out to prevent reading jibberish:

Perhaps you have a better suggestion on how to solve this problem. I'd be glad to implement a more elegant solution.

bold84 commented 6 months ago

maybe connection_parameters can be abused to decide at runtime what type of char / string is stored in the database.

I think it might also be possible to SQLDescribeCol and determine at runtime then whether to convert back and forth. But the performance penalty is unacceptable.

Or an additional type could be introduced:

ODBC Data Type	SOCI Data Type (`db_type`)	`row::get<T>` specializations
SQLWCHAR, SQLWVARCHAR	db_wstring	std::string

The user facing interface would then still only support UTF-8 (note: no std::wstring).

vadz commented 6 months ago

The problem I have is that the data has been written already with another application and I have no influence on that. The database(s) are not UTF-8.

I see, thanks.

Perhaps you have a better suggestion on how to solve this problem.

We do need to add SQLWCHAR support to be able to work with the existing database using UTF-16, but I think it should be available in addition to UTF-8 support with plain SQLCHAR, with the decision about which one to use being performed at run-time rather than compile-time.

Ideally, this should be automatic, i.e. when exchanging data with SQLWCHAR columns, UTF-16 should be read from/written to them, while the same code should use UTF-8 when working with SQLCHAR columns, but I don't know if it's possible to implement this easily, so perhaps we need a new db_wstring type instead. If we could make this work automatically, it would be really great, however.

The main point is that I'd really, really love to avoid different incompatible builds. The conditional compilation directives in the tests are a good example of how we do not want the code using SOCI to look like. And there are other problems, e.g. we'd need to add wide char builds to the CI too if we do it like this and I'd rather avoid it.

bold84 commented 6 months ago

The problem I have is that the data has been written already with another application and I have no influence on that. The database(s) are not UTF-8.

I see, thanks.

Perhaps you have a better suggestion on how to solve this problem.

We do need to add SQLWCHAR support to be able to work with the existing database using UTF-16, but I think it should be available in addition to UTF-8 support with plain SQLCHAR, with the decision about which one to use being performed at run-time rather than compile-time.

Ideally, this should be automatic, i.e. when exchanging data with SQLWCHAR columns, UTF-16 should be read from/written to them, while the same code should use UTF-8 when working with SQLCHAR columns, but I don't know if it's possible to implement this easily, so perhaps we need a new db_wstring type instead. If we could make this work automatically, it would be really great, however.

The main point is that I'd really, really love to avoid different incompatible builds. The conditional compilation directives in the tests are a good example of how we do not want the code using SOCI to look like. And there are other problems, e.g. we'd need to add wide char builds to the CI too if we do it like this and I'd rather avoid it.

Okay, I think we're on the same page. Shall there be an implicit conversion to std::string or shall std::wstring be supported in case of SQLWCHAR based columns?

Krzmbrzl commented 6 months ago

Also note that wstring_convert has been deprecated and scheduled for removal from the standard (without replacement). It's a shame, but either way everyone here should be aware of this 👀

bold84 commented 6 months ago

Also note that wstring_convert has been deprecated and scheduled for removal from the standard (without replacement). It's a shame, but either way everyone here should be aware of this 👀

That could temporarily (until there's a replacement) be solved with icu, libiconv or alike. But I wasn't sure if that's acceptable at this point. On the other hand, it could be optional.

vadz commented 6 months ago

Shall there be an implicit conversion to std::string or shall std::wstring be supported in case of SQLWCHAR based columns?

I think it would make sense to support wstring as people using SQLWCHAR in ODBC are most likely to use it too. But ideally I'd like to be able to support string to/from SQLWCHAR mapping too using UTF-8.

That could temporarily (until there's a replacement) be solved with icu, libiconv or alike.

I'd rather not pull in ICU just for this and libiconv is Unix-only. If necessary, I can contribute my own code, written many years ago, converting between UTF-8 and wchar_t (i.e. either UTF-16 or UTF-32). It has some lower level functions as well as

std::string ToUTF8(const std::wstring& wstr);
std::wstring FromUTF8(const std::string& uft8);

bold84 commented 6 months ago

That sounds good!

Okay, I'll add std::wstring support in a separate PR first.

Afterwards we can add the conversion to std::string.

bold84 commented 6 months ago

We might want to move the discussion to here:

https://github.com/SOCI/soci/pull/1133

SOCI / soci

UTF-16 support for ODBC (for MSSQL) #1041