Closed MaceWindu closed 6 years ago
Also this test doesn't have issues with other databases we support for those characters - they always returned as is. So I would say - if possible - no codepoint changes should be done. But maybe it is server-side issue which cannot be fixed on provider's side.
On the reference driver, I tried running a simple query: select @input
@input
was set up as a DbType.String
(unicode) with value '\u2000'
.
The value I get back is '\u2001'
.
Ribo output follows...
What the driver sends:
PARAMS Token (0xD7); variable length.
Param 1
Length [4]: 2
Param data [2]: "?"
(as bytes) [2]: 0x0020
What the server responds with:
ROW Token (0xD1); variable length.
Column 1
Length [4]: 2
Row data [2]: "?"
(as bytes) [2]: 0x0220
So it seems that the server is in fact messing with the data.
I will look at making our driver behave like the reference driver
@MaceWindu I've not yet been able to reproduce this issue with our driver.
The only issues I encountered were to do with our handling (or lack thereof) of the null-terminator character, \0
.
To view the test cases I've added related to this, check out the UnicodeStringTests.cs
on the issue-51-unicode-whitespace
branch
I've run these against the reference driver and behaviour appears consistent.
Any thoughts?
I will provide more details on specific test case
table:
CREATE TABLE AllTypes
(
ID int IDENTITY,
char20DataType char(20) NULL,
ncharDataType nchar(20) NULL,
)
queries:
- write
INSERT INTO [AllTypes]
(
[ncharDataType],
[char20DataType]
)
VALUES
(
-- this is 'test09\u2000\u0020'
N'test09 ',
'test09 '
)
-- read
SELECT
[t1].[ID],
[t1].[char20DataType],
[t1].[ncharDataType]
FROM
[AllTypes] [t1]
I've debugged into provider and see that value read as TdsDataType.TDS_VARCHAR:
case TdsDataType.TDS_CHAR:
case TdsDataType.TDS_VARCHAR:
case TdsDataType.TDS_BOUNDARY:
case TdsDataType.TDS_SENSITIVITY:
return stream.ReadNullableByteLengthPrefixedString(env.Encoding);
where encoding is Latin1 which doesn't support \u2000 and replace it with space
I've created isolated test to help with this issue: DDL
CREATE TABLE AllTypes
(
ID int NOT NULL,
ncharDataType nchar(20) NULL
)
test
var cmd = db.CreateCommand();
cmd.CommandText = "INSERT INTO AllTypes(ID, ncharDataType) VALUES(123, @p)";
var p = cmd.CreateParameter();
p.ParameterName = "@p";
p.DbType = System.Data.DbType.String;
p.Value = "test09\u2000\u2001 ";
cmd.Parameters.Add(p);
cmd.ExecuteNonQuery();
cmd = db.CreateCommand();
cmd.CommandText = "SELECT ncharDataType FROM AllTypes WHERE ID = 123";
using (var rd = cmd.ExecuteReader())
{
rd.Read();
var value = rd.GetString(0);
Assert.AreEqual("test09\u2000\u2001 ".Replace('\u2000', '\u2002').Replace('\u2001', '\u2003').TrimEnd(' '), value);
}
What you should observe:
Played with this test a bit and if I specify AnsiString for insert parameter, reference provider doesn't replace characters and return test09\u2000\u2001
, which is somehow interesting, as even ANSI encoding is a vague term, it shouldn't support \u2000 and \u2001 from my PoV. AdoNetCore returns 'test09' here.
Didn't tried AseDbType values for insert parameter.
I would say even if there is different behavior between providers that should be fixed, I think we at linq2db also should retest it a bit more and maybe change parameters types we use for Sybase strings by default.
additional testcase: in our tests linq2db inserts data using literals, without parameters. Like N'test_string_here'
Did some testing here: https://github.com/linq2db/linq2db/pull/1300/files#diff-8ecaf8e3054df43ad521ebf4f43225adR357
I would say reference driver has following issues:
Your driver have similar issue with \x0 character and encoding issue with non-latin characters in general.
I would say it would be nice if your driver will fix encoding issue to be on par with reference driver. Ideal case will be to work properly, but it could be server/protocol limitation in play here.
\x0
is equivalent to \0
, isn't it? I've noticed that the reference driver treats it as a nul-terminator and trims the remainder of the string off before transmission.
Here's the test which describes it (the cases are run against both ours and the reference driver; test case data in format description
, input
, expected output
): https://github.com/DataAction/AdoNetCore.AseClient/blob/d07c411bef412dcd7859023efa8b3c4ca7aff957/test/AdoNetCore.AseClient.Tests/Integration/Query/StringTests.cs#L216-L218
Originally we didn't handle \0
, but in keeping with how the reference driver behaves, I implemented this fix:
https://github.com/DataAction/AdoNetCore.AseClient/blob/d07c411bef412dcd7859023efa8b3c4ca7aff957/src/AdoNetCore.AseClient/Internal/ParameterValueExtensions.cs#L34-L40
There are some oddities in how the database seems to handle whitespace characters, but this appears to not be an issue with the drivers, but the server (the cases are run against both ours and the reference driver; test case data in format description
, input
, expected output
):
https://github.com/DataAction/AdoNetCore.AseClient/blob/d07c411bef412dcd7859023efa8b3c4ca7aff957/test/AdoNetCore.AseClient.Tests/Integration/Query/StringTests.cs#L226-L229
Note: unit tests are run against a utf8 charset database, using a charset=utf8 connection parameter. The exception is that as per spec, unichar/univarchar strings should ignore the connection's charset parameter and are utf-16 encoded.
What charset/encoding scheme does your test db/connection use? It'd be nice if I could write tests which work against the basic database charsets.
Regarding \0 - I don't know why reference driver does it (doesn't make sense to me, especially if server can handle it), but it is up to you how to implement it - as reference driver or support it in parameters. I would say for things like that it would be nice to have ReferenceMode setting in your provider :).
Will check charset issue and report back, but I don't think we have different test environments for reference and your provider.
We don't specify charset in connection string for tests, so it is like that:
Data Source=host;Port=5000;Database=TestDataCore;Uid=sa;Password=password;ConnectionTimeout=300;EnableBulkLoad=1;MinPoolSize=100;MaxPoolSize=200;
Database character set (sp_helpsort):
Character Set = 190, utf8
Unicode 3.1 UTF-8 Character Set
Class 2 Character Set
Sort Order = 45, altdict
Alternate (lower-case first) dictionary ordering
Regarding 2000->2002 conversions. It was expected behavior, because I found that ASE has enable unicode normalization
setting. Disabling it removes this conversion. My bad, need to read documentation :)
I would say for things like that it would be nice to have ReferenceMode setting in your provider :).
Interesting idea, I'll have a think about this -- probably the default would be true
so as to do the least disruption.
Character Set = 190, utf8
Yep, it looks like our test envs use the same charset; but a different sort order, which I don't think would matter (currently we have bin_utf8
).
Regarding 2000->2002 conversions. It was expected behavior, because I found that ASE has enable unicode normalization setting. Disabling it removes this conversion. My bad, need to read documentation :)
Ahah, I didn't know about that either, I'll probably disable and change tests to suit
I've figured out why our driver has issues if a charset isn't specified (manifests with the '?' character being returned).
Our driver was forcing a default connection character set (iso_1) if the connection string doesn't specify one. If the driver forces a character set, the server agrees. If the driver doesn't specify a character set, the server will tell the client which one to use (e.g. utf8).
So, the driver was shooting itself in the foot by specifying "iso_1" as a default, when it should just fact specify nothing.
For reference, here's the fix: https://github.com/DataAction/AdoNetCore.AseClient/pull/80/commits/bd77d77d760e85c5dceb9dfefac9de624a47c862#diff-a31732ebd4b72507b7387d9a9d60c0f4L256
Fix will be in the next release
Fixed
It is a bit strange issue, because native provider also do strange things with those characters for some reason.
table column:
ncharDataType nchar(20) NULL
Test inserts string and then read it back and compare.
Test have issue with following two characters: \u2000 and \u2001 https://www.fileformat.info/info/unicode/char/2000/index.htm https://www.fileformat.info/info/unicode/char/2002/index.htm
For native provider we receive decomposition of those characters: \2002 and \u2003 For managed provider we receive \u20 (space), which is also correct replacement.
Not sure how it should be fixed or should it be fixed at all, as from my PoV, database/provider shouldn't apply unicode normalization rules.
linq2db test:
CharTypesTests.CharTrimming