Open mloskot opened 11 years ago
Is this just for MSSQL or in general? I see no reason to disallow arbitrary strings. If a user wants to use a differently encoded string, or indeed arbitrary binary data, and a given engine allows it, why would SOCI go out of its way to prevent that use case?
Thanks,
Aleksander
On Tue, Jun 25, 2013 at 4:34 PM, Mateusz Loskot notifications@github.comwrote:
As per recent discussion, we need to:
Handle only UTF-8 for multibyte encodings right now and, perhaps, throw an error if we can detect that the database uses anything else. Formalize this by documenting that std::string used by SOCI is supposed to always be in UTF-8.
— Reply to this email directly or view it on GitHubhttps://github.com/SOCI/soci/issues/166 .
@pfedor I agree with you and I think my comment on soci-users is aligned with this:
Regarding multi byte encoding, we just take what we get and hand over to client or database, using std::string as array of bytes.
but then I assumed that if user or backend needs to encode strings and encoding is not known, UTF-8 is assumed. I might have got confused myself.
I think binary data are a different store because they come from differently typed columns.
But for the really text data, I think using std::string
in unspecified encoding is pretty dangerous as you simply don't know what to do about it and so it can easily result in silent data loss. Of course, if SOCI refuses to return it in the first place if it's anything but UTF-8, it could still be seen as data loss but at least it wouldn't be silent (SOCI would throw an exception).
@vadz: "Of course, if SOCI refuses to return it in the first place if it's anything but UTF-8, it could still be seen as data loss but at least it wouldn't be silent (SOCI would throw an exception)"
This would explicitely force std:string class usage to fail always while using MSSQL Server with non Unicode (ISO-8859-1), because the driver returns always ISO-8859-1 ASCII instead of utf8 (as said MS doesn't handle utf8 at all except within IIS aspx converting internal layer). Doing it hardly by exceptional state will rule out MSSQL server and would make it useless for non unicode MS DB's.
Correction: it would make it useless for the fields actually containing non-ASCII data, as ASCII is the same whether it's encoded in CP1252 (used by MS SQL Server) or UTF-8 (used by everybody else). And while this could be seen as still very bad, I'm not sure how is receiving data with unknown encoding is really better, so the current situation is only better if you work with a single RDBMS only.
Of course, the ideal solution would be to recode the text data to/from UTF-8 as needed. But this is more difficult...
I've just replied to @hrabe 's post unsupported BLOB vector support @ ODBC with links to previous discussions I had with @pfedor on that (and it was not the first time we discussed that, Maciej Sobczak also used to advocate similar solution). There were several prototypes of binary support, first by Artur Bać in bytea, varbinary, then also followed my prototype soci-type-binary.
I think it really needs to be decided to work out concensus about std::string
What are the drawbacck of using it as universal carrier for bytes?
The way binary vs textual data is handled is based on usage - user knows data she binds to.
The way textual data with encoding is handled depends on backend - some are configurable at run-time, some may have fixed encoding for server/client communication.
In this case, treatment of textual data carreid in std::string
will depend the run-time configuration, including conversions between std::string
and std::wstring
.
In other words, for textual data, there are two levels of treatment: first binary vs textual; if textual, what's the encoding.
At first level, use of std::string
as container does not require any additional handling.
Only the encoding does, for reliable interpretation.
What am I missing?
I admit that I'm frequently confused myself (see this ticket :)).
I think you're right but I don't think it helps :-)
We can indeed use std::string
for both binary and textual data. But in the latter case we must either always use UTF-8 for it or provide its encoding in some out-of-band way. Because, once again, a char*
string without encoding information is just a data loss/corruption waiting to happen.
And I don't think we're going to replace all occurrences of std::string
in the API with a std::pair<std::string, soci::encoding>
or anything like that. Which is why I still think that UTF-8 should be used.
Is there any reason soci even needs to know whether a particular string sent to a database is arbitrary binary data or text, let alone in which encoding?
Thanks,
Aleksander
On Wed, Jun 26, 2013 at 4:51 PM, VZ notifications@github.com wrote:
I think you're right but I don't think it helps :-)
We can indeed use std::string for both binary and textual data. But in the latter case we must either always use UTF-8 for it or provide its encoding in some out-of-band way. Because, once again, a char* string without encoding information is just a data loss/corruption waiting to happen.
And I don't think we're going to replace all occurrences of std::stringin the API with a std::pair<std::string, soci::encoding> or anything like that. Which is why I still think that UTF-8 should be used.
— Reply to this email directly or view it on GitHubhttps://github.com/SOCI/soci/issues/166#issuecomment-20088491 .
I have concerns to find a all people accepted unique way. First of all let me start with a real world example: In you kitchen you have an empty milk bottle. You decide to store acid inside because it can hold acid. You leave the room cause the bell rings. Your child enters the kitchen, finds the bottle. What you think, your child assumes watching a full milk bottle?
The same is with generic use of a std::string
type as multi-purpose container.
I assume watching a std::string
:
I assume watching a std::wstring
If both will be used as multi-purpose container, then the strong type based character of C++ gets lost. If I want something like this, I could use PHP, Ruby etc.
Furthermore, if std::string
would be such a "I can hold everything" container, than the whole project only needs to be based on std::string
cause it also can hold int
, bool
, date
etc. inside too. This leads to a variant typed usage and all is string with convertings like: data.asString(), data.asInt(), data.asDate() etc.
This is what I know from Genesys ACD databases, where each and every column is whitespace padded string, even it the column stands for a int, date, time or what ever.
Don't get me wrong. It would be fine to assume, that std::string
is a string in any case and have to be stricly encoded as utf8. But that's it! For any other content like binary one, it is a dedicate data type, so it haven't to be stored inside a type not expected for binary content at the first place. So it makes more sence in my opinion to separate those things into:
std::string
-> utf8 encoded holder of pure printable string content (no embedded '\0' or binary)
std::wstring
-> encoded either UTF16, UTF32 or UCS2 printable string content (no binary)
soci::binarydata
-> holder of any kind or binary data even it would be large text
std::string
defined as char*
content indicating text whereas unsigned char*
indicating some kind of binary (bytes). The size of one element seems equal and can hold each other, the meaning of both types indicating different things anyway.
BTW: Reading a none unicode string column from MSSQL (ISO-8859-1) with the text content: "Über" as std::string
will return:
0xDC 0x62 0x65 0x72
-> pure ASCII bytes in scope of ISO-8859-1
but expected by SOCI as read from your prior thread answers is:
0xC3 0x9C 0x62 0x65 0x72
-> utf8 bytes of the same string content.
Knowing this all, it's worth to think and discuss more deeply about:
It may not help directly, but it is important to agree on some constraints.
Yes, I have similar understanding, that if we want to do more than just forwarding bytes in or out. Indeed, we must either always use UTF-8 for it or provide its encoding, some implicit or explicit metadata would be necessary.
On 27 June 2013 00:51, VZ notifications@github.com wrote:
I think you're right but I don't think it helps :-)
We can indeed use std::string for both binary and textual data. But in the latter case we must either always use UTF-8 for it or provide its encoding in some out-of-band way. Because, once again, a char* string without encoding information is just a data loss/corruption waiting to happen.
And I don't think we're going to replace all occurrences of std::stringin the API with a std::pair<std::string, soci::encoding> or anything like that. Which is why I still think that UTF-8 should be used.
— Reply to this email directly or view it on GitHubhttps://github.com/SOCI/soci/issues/166#issuecomment-20088491 .
Mateusz Loskot, http://mateusz.loskot.net
Aleksander, that's the question I ask myself too.
But, if we want to do more than just forwarding bytes in or out, then I
don't see any alternative to achieve std::string to std::wstring conversion,
apart from std::ctype::narrow
and std::ctype::widen
options.
Enlightenment welcome :)
On 27 June 2013 01:11, Pawel Aleksander Fedorynski <notifications@github.com
wrote:
Is there any reason soci even needs to know whether a particular string sent to a database is arbitrary binary data or text, let alone in which encoding?
Thanks,
Aleksander
On Wed, Jun 26, 2013 at 4:51 PM, VZ notifications@github.com wrote:
I think you're right but I don't think it helps :-)
We can indeed use std::string for both binary and textual data. But in the latter case we must either always use UTF-8 for it or provide its encoding in some out-of-band way. Because, once again, a char* string without encoding information is just a data loss/corruption waiting to happen.
And I don't think we're going to replace all occurrences of std::stringin the API with a std::pair<std::string, soci::encoding> or anything like that. Which is why I still think that UTF-8 should be used.
— Reply to this email directly or view it on GitHub< https://github.com/SOCI/soci/issues/166#issuecomment-20088491> .
— Reply to this email directly or view it on GitHubhttps://github.com/SOCI/soci/issues/166#issuecomment-20089287 .
Mateusz Loskot, http://mateusz.loskot.net
Support of std::string and also std::wstring can be done in parallel without collision. The ODBC driver (UNICODE versions) do the necessary conversation if the column gets reported as wstring type but gets bound as string type. If UNICODE drivers will be used (I can only talk from Windows scope) any string column will be reported as wide type and comes primary as wstring UTF-16 (UCS2) content. Binding ordinary string against will force the driver to convert into a string matching encoding (ansi or utf8) automatically. See fork https://github.com/injixo/soci/commit/9b0628252d8144b947b1c5cc896e4d8b9c734764
This has been tested using Windows 32bit ODBC UNICODE driver:
The 64bit test still pending but expecting successful tests too.
As long as you don't use row or rowset you will get what you bind to. At row ( rowset ) you will see, what the driver delivers, cause the datatype may be dt_wstring instead of already known dt_string (still both supported).
As per recent discussion, Support of MSSQL Server we need to:
Handle only UTF-8 for multibyte encodings right now and, perhaps, throw an error if we can detect that the database uses anything else.(this part is controversial)Formalize this by documenting that
std::string
used by SOCI is supposed to always be in UTF-8.