[linq2db] Different handling of \u2000 (and probably \u2001) character by native and managed providers

MaceWindu commented 6 years ago

It is a bit strange issue, because native provider also do strange things with those characters for some reason.

table column: ncharDataType nchar(20) NULL

Test inserts string and then read it back and compare.

Test have issue with following two characters: \u2000 and \u2001 https://www.fileformat.info/info/unicode/char/2000/index.htm https://www.fileformat.info/info/unicode/char/2002/index.htm

For native provider we receive decomposition of those characters: \2002 and \u2003 For managed provider we receive \u20 (space), which is also correct replacement.

Not sure how it should be fixed or should it be fixed at all, as from my PoV, database/provider shouldn't apply unicode normalization rules.

linq2db test: CharTypesTests.CharTrimming

MaceWindu commented 6 years ago

Also this test doesn't have issues with other databases we support for those characters - they always returned as is. So I would say - if possible - no codepoint changes should be done. But maybe it is server-side issue which cannot be fixed on provider's side.

senseibaka commented 6 years ago

On the reference driver, I tried running a simple query: select @input @input was set up as a DbType.String (unicode) with value '\u2000'. The value I get back is '\u2001'. Ribo output follows... What the driver sends:

PARAMS Token (0xD7); variable length.
  Param 1
    Length [4]:                 2
    Param data [2]:             "?"
        (as bytes) [2]:         0x0020

What the server responds with:

ROW Token (0xD1); variable length.
  Column 1
    Length [4]:                 2
    Row data [2]:               "?"
        (as bytes) [2]:         0x0220

So it seems that the server is in fact messing with the data.

I will look at making our driver behave like the reference driver

senseibaka commented 6 years ago

@MaceWindu I've not yet been able to reproduce this issue with our driver.

The only issues I encountered were to do with our handling (or lack thereof) of the null-terminator character, \0.

To view the test cases I've added related to this, check out the UnicodeStringTests.cs on the issue-51-unicode-whitespace branch

I've run these against the reference driver and behaviour appears consistent.

Any thoughts?

MaceWindu commented 6 years ago

I will provide more details on specific test case

table:

CREATE TABLE AllTypes
(
    ID                       int           IDENTITY,

    char20DataType           char(20)          NULL,
    ncharDataType            nchar(20)         NULL,
)

queries:

- write
INSERT INTO [AllTypes]
(
    [ncharDataType],
    [char20DataType]
)
VALUES
(
-- this is 'test09\u2000\u0020'
    N'test09  ',
    'test09     '
)

-- read
SELECT
    [t1].[ID],
    [t1].[char20DataType],
    [t1].[ncharDataType]
FROM
    [AllTypes] [t1]

I've debugged into provider and see that value read as TdsDataType.TDS_VARCHAR:

case TdsDataType.TDS_CHAR:
                case TdsDataType.TDS_VARCHAR:
                case TdsDataType.TDS_BOUNDARY:
                case TdsDataType.TDS_SENSITIVITY:
                    return stream.ReadNullableByteLengthPrefixedString(env.Encoding);

where encoding is Latin1 which doesn't support \u2000 and replace it with space

MaceWindu commented 6 years ago

I've created isolated test to help with this issue: DDL

CREATE TABLE AllTypes
(
    ID                       int           NOT NULL,
    ncharDataType            nchar(20)         NULL
)

test

var cmd = db.CreateCommand();
                cmd.CommandText = "INSERT INTO AllTypes(ID, ncharDataType) VALUES(123, @p)";
                var p = cmd.CreateParameter();
                p.ParameterName = "@p";
                p.DbType = System.Data.DbType.String;
                p.Value = "test09\u2000\u2001 ";
                cmd.Parameters.Add(p);
                cmd.ExecuteNonQuery();

                cmd = db.CreateCommand();
                cmd.CommandText = "SELECT ncharDataType FROM AllTypes WHERE ID = 123";
                using (var rd = cmd.ExecuteReader())
                {
                    rd.Read();
                    var value = rd.GetString(0);
                    Assert.AreEqual("test09\u2000\u2001 ".Replace('\u2000', '\u2002').Replace('\u2001', '\u2003').TrimEnd(' '), value);
                }

What you should observe:

space from end removed for reference provider and AdoNetCore
reference provider (or db?) replaces \u2000 -> \u2002, \u2001 -> \u2003 (which is correct in unicode terms, but actually shouldn't be done here)
AdoNetProvider returns ? (\x3f) replacement character, which indicates that codepage without requested codepoint were used somewhere

MaceWindu commented 6 years ago

Played with this test a bit and if I specify AnsiString for insert parameter, reference provider doesn't replace characters and return test09\u2000\u2001, which is somehow interesting, as even ANSI encoding is a vague term, it shouldn't support \u2000 and \u2001 from my PoV. AdoNetCore returns 'test09' here.

Didn't tried AseDbType values for insert parameter.

I would say even if there is different behavior between providers that should be fixed, I think we at linq2db also should retest it a bit more and maybe change parameters types we use for Sybase strings by default.

additional testcase: in our tests linq2db inserts data using literals, without parameters. Like N'test_string_here'

MaceWindu commented 6 years ago

Did some testing here: https://github.com/linq2db/linq2db/pull/1300/files#diff-8ecaf8e3054df43ad521ebf4f43225adR357

I would say reference driver has following issues:

it cannot pass \x0 character as part of string parameter (will not be surprised if it is protocol limitation)
it cannot pass \u2000 and \u2001 without change as a part of literal in any form, but can do it for some parameter types

Your driver have similar issue with \x0 character and encoding issue with non-latin characters in general.

I would say it would be nice if your driver will fix encoding issue to be on par with reference driver. Ideal case will be to work properly, but it could be server/protocol limitation in play here.

senseibaka commented 6 years ago

\x0 is equivalent to \0, isn't it? I've noticed that the reference driver treats it as a nul-terminator and trims the remainder of the string off before transmission. Here's the test which describes it (the cases are run against both ours and the reference driver; test case data in format description, input, expected output): https://github.com/DataAction/AdoNetCore.AseClient/blob/d07c411bef412dcd7859023efa8b3c4ca7aff957/test/AdoNetCore.AseClient.Tests/Integration/Query/StringTests.cs#L216-L218

Originally we didn't handle \0, but in keeping with how the reference driver behaves, I implemented this fix: https://github.com/DataAction/AdoNetCore.AseClient/blob/d07c411bef412dcd7859023efa8b3c4ca7aff957/src/AdoNetCore.AseClient/Internal/ParameterValueExtensions.cs#L34-L40

There are some oddities in how the database seems to handle whitespace characters, but this appears to not be an issue with the drivers, but the server (the cases are run against both ours and the reference driver; test case data in format description, input, expected output): https://github.com/DataAction/AdoNetCore.AseClient/blob/d07c411bef412dcd7859023efa8b3c4ca7aff957/test/AdoNetCore.AseClient.Tests/Integration/Query/StringTests.cs#L226-L229

Note: unit tests are run against a utf8 charset database, using a charset=utf8 connection parameter. The exception is that as per spec, unichar/univarchar strings should ignore the connection's charset parameter and are utf-16 encoded.

What charset/encoding scheme does your test db/connection use? It'd be nice if I could write tests which work against the basic database charsets.

MaceWindu commented 6 years ago

Regarding \0 - I don't know why reference driver does it (doesn't make sense to me, especially if server can handle it), but it is up to you how to implement it - as reference driver or support it in parameters. I would say for things like that it would be nice to have ReferenceMode setting in your provider :).

Will check charset issue and report back, but I don't think we have different test environments for reference and your provider.

MaceWindu commented 6 years ago

We don't specify charset in connection string for tests, so it is like that:

Data Source=host;Port=5000;Database=TestDataCore;Uid=sa;Password=password;ConnectionTimeout=300;EnableBulkLoad=1;MinPoolSize=100;MaxPoolSize=200;

Database character set (sp_helpsort):

Character Set = 190, utf8
    Unicode 3.1 UTF-8 Character Set
    Class 2 Character Set
Sort Order = 45, altdict
    Alternate (lower-case first) dictionary ordering

Regarding 2000->2002 conversions. It was expected behavior, because I found that ASE has enable unicode normalization setting. Disabling it removes this conversion. My bad, need to read documentation :)

senseibaka commented 6 years ago

I would say for things like that it would be nice to have ReferenceMode setting in your provider :).

Interesting idea, I'll have a think about this -- probably the default would be true so as to do the least disruption.

Character Set = 190, utf8

Yep, it looks like our test envs use the same charset; but a different sort order, which I don't think would matter (currently we have bin_utf8).

Regarding 2000->2002 conversions. It was expected behavior, because I found that ASE has enable unicode normalization setting. Disabling it removes this conversion. My bad, need to read documentation :)

Ahah, I didn't know about that either, I'll probably disable and change tests to suit

senseibaka commented 6 years ago

I've figured out why our driver has issues if a charset isn't specified (manifests with the '?' character being returned).

Our driver was forcing a default connection character set (iso_1) if the connection string doesn't specify one. If the driver forces a character set, the server agrees. If the driver doesn't specify a character set, the server will tell the client which one to use (e.g. utf8).

So, the driver was shooting itself in the foot by specifying "iso_1" as a default, when it should just fact specify nothing.

For reference, here's the fix: https://github.com/DataAction/AdoNetCore.AseClient/pull/80/commits/bd77d77d760e85c5dceb9dfefac9de624a47c862#diff-a31732ebd4b72507b7387d9a9d60c0f4L256

senseibaka commented 6 years ago

Fix will be in the next release

MaceWindu commented 6 years ago

Fixed

DataAction / AdoNetCore.AseClient

[linq2db] Different handling of \u2000 (and probably \u2001) character by native and managed providers #51