bizstation / transactd

The high-speed and advanced NoSQL interface plugin for MySQL / MariaDB.
GNU General Public License v2.0
18 stars 1 forks source link

Multibyte varchar type values are corrupted. #27

Closed silverist closed 9 years ago

silverist commented 9 years ago

I insert multibyte string to varchar type column by using MultiByteToWideChar like as follows, but the values in the DB table are corrupted. Is there another way?

//note is convert result value.
after_length = MultiByteToWideChar(CP_UTF8, MB_PRECOMPOSED, pll.note.c_str(), -1, NULL, 0);
MultiByteToWideChar(CP_UTF8, MB_PRECOMPOSED, pll.note.c_str(), -1, note, after_length);

■Table settings

Default Referential Integrity: utf8_general_ci
"Variable_name"             "Value"
---------------------------------------------------------------
"character_set_client"      "utf8mb4"
"character_set_connection"  "utf8mb4"
"character_set_database"    "utf8"
"character_set_filesystem"  "binary"
"character_set_results"     "utf8mb4"
"character_set_server"      "utf8"
"character_set_system"      "utf8"
"character_sets_dir"        "C:\Program Files\MariaDB 10.0\share\charsets\"
bizstation commented 9 years ago

Please tell us followings:

silverist commented 9 years ago

They are as follows. "note" column data is corrupted.

■SHOW CREATE TABLE result

CREATE TABLE `lines_pl_ob` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `symbol` varchar(20) NOT NULL,
  `ss_time` date NOT NULL,
  `type` int(11) NOT NULL DEFAULT '0',
  `status` int(11) NOT NULL DEFAULT '0',
  `value` double NOT NULL DEFAULT '0',
  `note` varchar(50) DEFAULT NULL,
  `addtime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updatetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=111 DEFAULT CHARSET=utf8

■The part of source code which sets values

    database* db = database::create();
    static const _TCHAR* uri = _T("tdap://root@localhost/ilte_line?dbfile=transactd_schema&pwd=r4nk0r4nk0");
    db->open(uri, TYPE_SCHEMA_BDF, TD_OPEN_NORMAL);
    if (db->stat() == 0)
    {
        //テーブルを開く
        table* tb = db->openTable(_T("lines_pl_ob"));

        tb->beginBulkInsert(BULKBUFSIZE);
        TCHAR* note = new TCHAR[100];
        std::setlocale(LC_ALL, "japanese");

        int after_length;

        for (auto pll : pl_line_list)
        {
            after_length = MultiByteToWideChar(CP_UTF8, MB_PRECOMPOSED, pll.note.c_str(), -1, NULL, 0);
            MultiByteToWideChar(CP_UTF8, MB_PRECOMPOSED, pll.note.c_str(), -1, note, after_length);
_tprintf(_T("%s\n"), note);

            tb->clearBuffer();
            tb->setFV(_T("symbol"), pll.symbol.c_str());
            tb->setFV(_T("ss_time"), pll.ss_time.c_str());
            tb->setFV(_T("type"), pll.type);
            tb->setFV(_T("status"), pll.status);
            tb->setFV(_T("value"), pll.value);
            tb->setFV(_T("note"), note);
//          tb->setFV(_T("addtime"), 100);
//          tb->setFV(_T("updatetime"), 100);
            tb->insert();

        }

        tb->commitBulkInsert();
        tb->close();
    }

    // close and release database object
    database::destroy(db);

■pl_line_list(insert data, instead of "The part of source code which gets and compares values")

symbol : ss_time             : type : status : value  : note
----------------------------------------------------------------
USDJPY : 2015-10-01 21:20:37 : 11   : 0      : 122.00 : 売り厚め
USDJPY : 2015-10-01 21:20:37 : 11   : 0      : 121.50 : 断続的に売り
USDJPY : 2015-10-01 21:20:37 : 11   : 0      : 121.60 : 断続的に売り
USDJPY : 2015-10-01 21:20:37 : 11   : 0      : 121.25 : 断続的に売り
USDJPY : 2015-10-01 21:20:37 : 11   : 0      : 121.30 : 断続的に売り
bizstation commented 9 years ago
after_length = MultiByteToWideChar(CP_UTF8, MB_PRECOMPOSED, pll.note.c_str(), -1, NULL, 0);

It seems to me that pll.note is UTF-8 string in above code. The charset of table is UTF-8 too, so you don't have to convert value. Simple do following:

tb->setFV(_T("note"), pll.note.c_str());
// table::setFV(const _TCHAR* fieldName, const char* value)

Apart from that, there is no problem if you use MultiByteToWideChar and pass WideChar string to setFV, because Transactd client convert charset automatically. The note value is correct in _tprintf(_T("%s\n"), note); line? Probably, second parameter should be 0 instead of MB_PRECOMPOSED if you use UTF-8. If pll.note is Shift-JIS string, the code will be:

after_length = MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, pll.note.c_str(), -1, NULL, 0);
MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, pll.note.c_str(), -1, note, after_length);

How did you check inserted values? Did you use getFV API in Transactd clients? Did you use SELECT SQL? Or other tool? (We want to make sure that data check procedure is not wrong.)

silverist commented 9 years ago

How did you check inserted values? Did you use getFV API in Transactd clients? Did you use SELECT SQL? Or other tool?

I check values on "Heidi SQL 9.1.0.4867" which comes with Maria DB 10.0. (show like the following snap shot.)

I try the following code, but "note" column is corrupted like the following snap shot.

tb->setFV(_T("note"), pll.note.c_str());
// table::setFV(const _TCHAR* fieldName, const char* value)

■insert data snap shot(on "Heidi SQL 9.1.0.4867" ) _201510271822

bizstation commented 9 years ago

In UTF-8, 売り厚め is encoded to following byte array (hex):

売 E5 A3 B2
り E3 82 8A
厚 E5 8E 9A
め E3 82 81

If this byte array will be decoded as "Shift-JIS", it will be 螢イ繧雁字繧....

It seems to me that the value was inserted correctly as UTF-8 byte array, but it is read as Shift-JIS by Heidi SQL. Please check Heidi SQL settings. Or, check with other tool (for example, execute SELECT note FROM lines_pl_ob SQL).

silverist commented 9 years ago

If this byte array will be decoded as "Shift-JIS", it will be 螢イ繧雁字繧....

It seems to me that the value was inserted correctly as UTF-8 byte array, but it is read as Shift-JIS by Heidi SQL.

I think so, too.

Just in case, I set "note" column encode to "Shift-JIS"(not change mysql character settings), then not corrupted on "Heidi SQL 9.1.0.4867" like the following snap shot. (Mysql convert from "Shift-JIS" to "utf8mb4"(UTF8) by the setting "character_set_client", or Transactd client convert, or else, I don't know well which it is caused by.)

■insert data snap shot(on "Heidi SQL 9.1.0.4867" ) _201510281204

bizstation commented 9 years ago

Probably this is problem in Heidi SQL. Heidi SQL does not convert UTF-8 string to Shift-JIS or UTF-16 on displaying. You use old version of Heidi SQL. Would you like to use latest version? Heidi SQL 9.3 is available.

Please check with Transactd client API. After tb->commitBulkInsert();, read data with code like followings:

tb->setKeyNum(0);
tb->setFV(_T("id"), 1); // The first record
tb->seek();
if (tb->stat() == 0)
    MessageBox(NULL, tb->getFVstr(_T("note")), _T("note"), 64));
else
    MessageBox(NULL, _T("seek error"), _T("note"), 16));

(I said "check with other tool" because I can not clarify that it is a problem of Transactd client or Heidi SQL. For example, execute SQL with default command line tool mysql.exe.)

silverist commented 9 years ago

I check "node" column with the following means after I insert them encoded by "UTF-8" and "Shift-JIS".

tb->setKeyNum(0);
tb->setFV(_T("id"), 1); // The first record
tb->seek();
if (tb->stat() == 0)
    MessageBox(NULL, tb->getFVstr(_T("note")), _T("note"), 64));
else
    MessageBox(NULL, _T("seek error"), _T("note"), 16));

The result are as follows.

encoded by "UTF-8"

■Transactd client API _utf8_mes_201510281724 ■command line tool mysql.exe on command prompt _utf8_mysql_201510281727 ■Heidi SQL 9.1.0.4867 _utf8_heidisql_9_1_201510281728 ■Heidi SQL 9.3.0.4984 _utf8_heidisql_9_3_201510281730

encoded by "Shift-JIS"

■Transactd client API _sjis_mes_201510281733 ■command line tool mysql.exe on command prompt _sjis_mysql_201510281737 ■Heidi SQL 9.1.0.4867 _sjis_heidisql_9_1_201510281738 ■Heidi SQL 9.3.0.4984 _sjis_heidisql_9_3_201510281739

bizstation commented 9 years ago

Can you run following code and tell the result?

#include <bzs/db/protocol/tdap/tdapSchema.h>
#include <bzs/db/protocol/tdap/client/database.h>

using namespace bzs::db::protocol::tdap::client;
using namespace bzs::db::protocol::tdap;

void dumpTableSchema(const tabledef* td)
{
    std::wstring s;
    _TCHAR tmp[100];
    s += _T("\nexec codepage\t");
    s += _ltot(nsdatabase::execCodePage(), tmp, 10);
    s +=  _T("\ntable charset\t");
    s += _ltot(td->charsetIndex, tmp, 10);
    s +=  _T("\nfield count\t");
    s += _ltot(td->fieldCount, tmp, 10);
    s +=  _T("\n");
    for (unsigned short i = 0; i < td->fieldCount; ++i)
    {
        fielddef* fd = &td->fieldDefs[i];
        s += _T("\nname\t");
        s += fd->name();
        s += _T("\ntype\t");
        s += getTypeName(fd->type);
        s +=  _T("\nlen\t");
        s += _ltot(fd->len, tmp, 10);
        s +=  _T("\ncharset\t");
        s += _ltot(fd->charsetIndex(), tmp, 10);
        s +=  _T("\n");
    }
    //_tprintf(s.c_str());
    OutputDebugString(s.c_str());
}

database* db = database::create();
static const _TCHAR* uri = _T("tdap://root@localhost/ilte_line?dbfile=transactd_schema&pwd=r4nk0r4nk0");
db->open(uri, TYPE_SCHEMA_BDF, TD_OPEN_NORMAL);
table* tb = db->openTable(_T("lines_pl_ob"));
dumpTableSchema(tb->tableDef());
bizstation commented 9 years ago

And, add this code to your inserting program to check pll.note.c_str() value.

MessageBoxA(NULL, pll.note.c_str(), "note", 64);
silverist commented 9 years ago

The results are as follows.

encoded by "UTF-8"

■dumpTableSchema

exec codepage   932
table charset   22
field count     9

name    id
type    AutoIncrement
len     4
charset 1

name    symbol
type    myVarChar
len     61
charset 22

name    ss_time
type    ZString
len     3
charset 1

name    type
type    Integer
len     4
charset 1

name    status
type    Integer
len     4
charset 1

name    value
type    Float
len     8
charset 1

name    note
type    myVarChar
len     151
charset 22

name    addtime
type    myTimeStamp
len     4
charset 1

name    updatetime
type    myTimeStamp
len     4
charset 1

■MessageBoxA ・pl_line_list[0] _utf8_mesa_pll_0_201510291903

・pl_line_list[1] ~ pl_line_list[4] _utf8_mesa_pll_1_201510291904

encoded by "Shift-JIS"

■dumpTableSchema

exec codepage   932
table charset   22
field count     9

name    id
type    AutoIncrement
len     4
charset 1

name    symbol
type    myVarChar
len     61
charset 22

name    ss_time
type    ZString
len     3
charset 1

name    type
type    Integer
len     4
charset 1

name    status
type    Integer
len     4
charset 1

name    value
type    Float
len     8
charset 1

name    note
type    myVarChar
len     151
charset 22

name    addtime
type    myTimeStamp
len     4
charset 1

name    updatetime
type    myTimeStamp
len     4
charset 1

■MessageBoxA ・pl_line_list[0] _sjis_mesa_pll_0_201510291849

・pl_line_list[1] ~ pl_line_list[4] _sjis_mesa_pll_1_201510291850

bizstation commented 9 years ago

Solution

Call following method on begining of your program:

nsdatabase::setExecCodePage(CP_UTF8);

It sets "execCodePage" to "UTF-8".

"execCodePage" means "the CodePage which is used in the program". Your execCodePage is 932 now, it means "Shift-JIS". In this case, strings which is passed to setFV as "char" are regarded as "Shift-JIS". Set it as "CP_UTF8" with above code, then you can pass "UTF-8" string correctly.

Charset converting in Transactd

Charset converting in "Multi-Byte Char" program

table::setFV(const _TCHAR* fieldName, const char* value) converts charset according to following steps:

  1. value is regarded as "execCodePage" string.
  2. If "execCodepage" is not same as "database (or table or column) codepage", convert value from "execCodePage" to "database codepage".
  3. Send it.

In your program...

  1. value is regarded as "Shift-JIS" string, despite it is "UTF-8" string.
  2. Convert value from "execCodePage (Shift-JIS)" to "database codepage (UTF-8")". The string is corrupted in this step...
  3. Send it.
silverist commented 9 years ago

I apply the above solution to my program, and "exec codepage" change "UTF-8" as follows, then "note" column becomes not corrupted when Insert data is encoded by "UTF-8". (It is the same too on transactd client 2.4.5)

Thank you for your support.

■dumpTableSchema

exec codepage   65001
table charset   22
field count     9

name    id
type    AutoIncrement
len     4
charset 1

name    symbol
type    myVarChar
len     61
charset 22

name    ss_time
type    ZString
len     3
charset 1

name    type
type    Integer
len     4
charset 1

name    status
type    Integer
len     4
charset 1

name    value
type    Float
len     8
charset 1

name    note
type    myVarChar
len     151
charset 22

name    addtime
type    myTimeStamp
len     4
charset 1

name    updatetime
type    myTimeStamp
len     4
charset 1