Document strings are always UTF-8

mloskot commented 11 years ago

As per recent discussion, Support of MSSQL Server we need to:

~~Handle only UTF-8 for multibyte encodings right now and, perhaps, throw an error if we can detect that the database uses anything else.~~ (this part is controversial)

Formalize this by documenting that std::string used by SOCI is supposed to always be in UTF-8.

pfedor commented 11 years ago

Is this just for MSSQL or in general? I see no reason to disallow arbitrary strings. If a user wants to use a differently encoded string, or indeed arbitrary binary data, and a given engine allows it, why would SOCI go out of its way to prevent that use case?

Thanks,

Aleksander

On Tue, Jun 25, 2013 at 4:34 PM, Mateusz Loskot notifications@github.comwrote:

As per recent discussion, we need to:

Handle only UTF-8 for multibyte encodings right now and, perhaps, throw an error if we can detect that the database uses anything else. Formalize this by documenting that std::string used by SOCI is supposed to always be in UTF-8.

— Reply to this email directly or view it on GitHubhttps://github.com/SOCI/soci/issues/166 .

mloskot commented 11 years ago

@pfedor I agree with you and I think my comment on soci-users is aligned with this:

Regarding multi byte encoding, we just take what we get and hand over to client or database, using std::string as array of bytes.

but then I assumed that if user or backend needs to encode strings and encoding is not known, UTF-8 is assumed. I might have got confused myself.

vadz commented 11 years ago

I think binary data are a different store because they come from differently typed columns.

But for the really text data, I think using std::string in unspecified encoding is pretty dangerous as you simply don't know what to do about it and so it can easily result in silent data loss. Of course, if SOCI refuses to return it in the first place if it's anything but UTF-8, it could still be seen as data loss but at least it wouldn't be silent (SOCI would throw an exception).

hrabe commented 11 years ago

@vadz: "Of course, if SOCI refuses to return it in the first place if it's anything but UTF-8, it could still be seen as data loss but at least it wouldn't be silent (SOCI would throw an exception)"

This would explicitely force std:string class usage to fail always while using MSSQL Server with non Unicode (ISO-8859-1), because the driver returns always ISO-8859-1 ASCII instead of utf8 (as said MS doesn't handle utf8 at all except within IIS aspx converting internal layer). Doing it hardly by exceptional state will rule out MSSQL server and would make it useless for non unicode MS DB's.

vadz commented 11 years ago

Correction: it would make it useless for the fields actually containing non-ASCII data, as ASCII is the same whether it's encoded in CP1252 (used by MS SQL Server) or UTF-8 (used by everybody else). And while this could be seen as still very bad, I'm not sure how is receiving data with unknown encoding is really better, so the current situation is only better if you work with a single RDBMS only.

Of course, the ideal solution would be to recode the text data to/from UTF-8 as needed. But this is more difficult...

mloskot commented 11 years ago

I've just replied to @hrabe 's post unsupported BLOB vector support @ ODBC with links to previous discussions I had with @pfedor on that (and it was not the first time we discussed that, Maciej Sobczak also used to advocate similar solution). There were several prototypes of binary support, first by Artur Bać in bytea, varbinary, then also followed my prototype soci-type-binary.

I think it really needs to be decided to work out concensus about std::string What are the drawbacck of using it as universal carrier for bytes?

The way binary vs textual data is handled is based on usage - user knows data she binds to.

The way textual data with encoding is handled depends on backend - some are configurable at run-time, some may have fixed encoding for server/client communication. In this case, treatment of textual data carreid in std::string will depend the run-time configuration, including conversions between std::string and std::wstring. In other words, for textual data, there are two levels of treatment: first binary vs textual; if textual, what's the encoding.

At first level, use of std::string as container does not require any additional handling. Only the encoding does, for reliable interpretation.

What am I missing?

I admit that I'm frequently confused myself (see this ticket :)).

vadz commented 11 years ago

I think you're right but I don't think it helps :-)

We can indeed use std::string for both binary and textual data. But in the latter case we must either always use UTF-8 for it or provide its encoding in some out-of-band way. Because, once again, a char* string without encoding information is just a data loss/corruption waiting to happen.

And I don't think we're going to replace all occurrences of std::string in the API with a std::pair<std::string, soci::encoding> or anything like that. Which is why I still think that UTF-8 should be used.

pfedor commented 11 years ago

Is there any reason soci even needs to know whether a particular string sent to a database is arbitrary binary data or text, let alone in which encoding?

Thanks,

Aleksander

On Wed, Jun 26, 2013 at 4:51 PM, VZ notifications@github.com wrote:

I think you're right but I don't think it helps :-)

We can indeed use std::string for both binary and textual data. But in the latter case we must either always use UTF-8 for it or provide its encoding in some out-of-band way. Because, once again, a char* string without encoding information is just a data loss/corruption waiting to happen.

And I don't think we're going to replace all occurrences of std::stringin the API with a std::pair<std::string, soci::encoding> or anything like that. Which is why I still think that UTF-8 should be used.

— Reply to this email directly or view it on GitHubhttps://github.com/SOCI/soci/issues/166#issuecomment-20088491 .

hrabe commented 11 years ago

I have concerns to find a all people accepted unique way. First of all let me start with a real world example: In you kitchen you have an empty milk bottle. You decide to store acid inside because it can hold acid. You leave the room cause the bell rings. Your child enters the kitchen, finds the bottle. What you think, your child assumes watching a full milk bottle?

The same is with generic use of a std::string type as multi-purpose container. I assume watching a std::string:

its printable content
it doesn't contain embedded '\0'
it may have some kind of printable encoding (ISO-8859-1, utf8, ...)

I assume watching a std::wstring

its printable too
it contains embedded '\0' cause of nature for w coded strings.
it may have encoding in kind of UTF-16, UTF-32 or UCS2

If both will be used as multi-purpose container, then the strong type based character of C++ gets lost. If I want something like this, I could use PHP, Ruby etc.

Furthermore, if std::string would be such a "I can hold everything" container, than the whole project only needs to be based on std::string cause it also can hold int, bool, date etc. inside too. This leads to a variant typed usage and all is string with convertings like: data.asString(), data.asInt(), data.asDate() etc. This is what I know from Genesys ACD databases, where each and every column is whitespace padded string, even it the column stands for a int, date, time or what ever.

Don't get me wrong. It would be fine to assume, that std::string is a string in any case and have to be stricly encoded as utf8. But that's it! For any other content like binary one, it is a dedicate data type, so it haven't to be stored inside a type not expected for binary content at the first place. So it makes more sence in my opinion to separate those things into: std::string -> utf8 encoded holder of pure printable string content (no embedded '\0' or binary) std::wstring -> encoded either UTF16, UTF32 or UCS2 printable string content (no binary) soci::binarydata -> holder of any kind or binary data even it would be large text

std::string defined as char* content indicating text whereas unsigned char* indicating some kind of binary (bytes). The size of one element seems equal and can hold each other, the meaning of both types indicating different things anyway.

BTW: Reading a none unicode string column from MSSQL (ISO-8859-1) with the text content: "Über" as std::string will return: 0xDC 0x62 0x65 0x72 -> pure ASCII bytes in scope of ISO-8859-1 but expected by SOCI as read from your prior thread answers is: 0xC3 0x9C 0x62 0x65 0x72 -> utf8 bytes of the same string content.

Knowing this all, it's worth to think and discuss more deeply about:

encoding support (explicite or implicite) inside SOCI
strong data typing to avoid multi-purpose container
handling correctly possible bulk operations with any data type (including correct ORM bulk's)
ability to detect kind of data types inside the implementation/backends (bindings etc.) and deal with them probably different

mloskot commented 11 years ago

It may not help directly, but it is important to agree on some constraints.

Yes, I have similar understanding, that if we want to do more than just forwarding bytes in or out. Indeed, we must either always use UTF-8 for it or provide its encoding, some implicit or explicit metadata would be necessary.

On 27 June 2013 00:51, VZ notifications@github.com wrote:

I think you're right but I don't think it helps :-)

We can indeed use std::string for both binary and textual data. But in the latter case we must either always use UTF-8 for it or provide its encoding in some out-of-band way. Because, once again, a char* string without encoding information is just a data loss/corruption waiting to happen.

And I don't think we're going to replace all occurrences of std::stringin the API with a std::pair<std::string, soci::encoding> or anything like that. Which is why I still think that UTF-8 should be used.

— Reply to this email directly or view it on GitHubhttps://github.com/SOCI/soci/issues/166#issuecomment-20088491 .

Mateusz Loskot, http://mateusz.loskot.net

mloskot commented 11 years ago

Aleksander, that's the question I ask myself too. But, if we want to do more than just forwarding bytes in or out, then I don't see any alternative to achieve std::string to std::wstring conversion, apart from std::ctype::narrow and std::ctype::widen options. Enlightenment welcome :)

On 27 June 2013 01:11, Pawel Aleksander Fedorynski <notifications@github.com

wrote:

Is there any reason soci even needs to know whether a particular string sent to a database is arbitrary binary data or text, let alone in which encoding?

Thanks,

Aleksander

On Wed, Jun 26, 2013 at 4:51 PM, VZ notifications@github.com wrote:

I think you're right but I don't think it helps :-)

We can indeed use std::string for both binary and textual data. But in the latter case we must either always use UTF-8 for it or provide its encoding in some out-of-band way. Because, once again, a char* string without encoding information is just a data loss/corruption waiting to happen.

And I don't think we're going to replace all occurrences of std::stringin the API with a std::pair<std::string, soci::encoding> or anything like that. Which is why I still think that UTF-8 should be used.

— Reply to this email directly or view it on GitHub< https://github.com/SOCI/soci/issues/166#issuecomment-20088491> .

— Reply to this email directly or view it on GitHubhttps://github.com/SOCI/soci/issues/166#issuecomment-20089287 .

Mateusz Loskot, http://mateusz.loskot.net

hrabe commented 11 years ago

Support of std::string and also std::wstring can be done in parallel without collision. The ODBC driver (UNICODE versions) do the necessary conversation if the column gets reported as wstring type but gets bound as string type. If UNICODE drivers will be used (I can only talk from Windows scope) any string column will be reported as wide type and comes primary as wstring UTF-16 (UCS2) content. Binding ordinary string against will force the driver to convert into a string matching encoding (ansi or utf8) automatically. See fork https://github.com/injixo/soci/commit/9b0628252d8144b947b1c5cc896e4d8b9c734764

This has been tested using Windows 32bit ODBC UNICODE driver:

Oracle 11.2
PostgreSQL 9.1.1.0
MySQL 5.1.2 (5.2.5 doesn't work and crashes during utilization)
SQLite 0.993
MSSQL Server Native Client 2009.100.1617.00

The 64bit test still pending but expecting successful tests too.

As long as you don't use row or rowset you will get what you bind to. At row ( rowset ) you will see, what the driver delivers, cause the datatype may be dt_wstring instead of already known dt_string (still both supported).

SOCI / soci

Document strings are always UTF-8 #166