SQLServer - fix unicode support for text type - nvarchar(max) instead of varchar(max)

michalbundyra commented 1 month ago

Q	A
Type	bug
Fixed issues	#2323

Summary

Very old bug for SQLServer platform preventing usage unicode characters with "text" type.

Background

Here are previously reported issues and attempts to fix it:

1182
1351
2323
5237

Before it was unfortunately not classified as a bug - see https://github.com/doctrine/dbal/pull/5237#issuecomment-1029294556 but let me provide here a bit more light, why actually IT IS a bug.

First of all in SQLServer we have types like:

varchar (variable length)
char (fixed length)
text (deprecated in favour of varchar(max))

But these types does not accept support UTF-8, so therefore there are the following types:

nvarchar (variable length)
nchar (fixed length)
ntext (deprecated in favour of nvarchar(max))

At some point, when doctrine evolved SQLServer platform from MSSQL the change has been made, but not in all places - see https://github.com/doctrine/dbal/commit/8b29ffe5a53e4b823615e2ffc2b39a04aab549e9:

So it was recognised that we suppose to use nvarchar in some places, but not ntext in the other. And since then it is just there.

Later it was just updated to use varchar(max) instead of text (as text become deprecated): https://github.com/doctrine/dbal/pull/451

Current situation

Now, on version 4.0.x, we still have no ability to use text field with SQLServer for UTF-8 characters.

The other fields:

fixed length: https://github.com/doctrine/dbal/blob/cbd0e9adb46ec66f22c163e7003d0f4b2e38c8c7/src/Platforms/SQLServerPlatform.php#L875
variable length: https://github.com/doctrine/dbal/blob/cbd0e9adb46ec66f22c163e7003d0f4b2e38c8c7/src/Platforms/SQLServerPlatform.php#L890 do use n version of the columns by, but not the text one: https://github.com/doctrine/dbal/blob/cbd0e9adb46ec66f22c163e7003d0f4b2e38c8c7/src/Platforms/SQLServerPlatform.php#L910-L913

and there is no easy option to use it, but overriding the type to provide support of length=-1 on the type level (so migrations is not trying to create nvarchar(-1) but uses nvarchar(max); -1 is reported back from the schema when we use max length, and this is also the same behaviour for varchar field).

Additional considerations

As it would solve the current issue, it's not a perfect solution. It might be even considered as a BC break for some, especially when taking into account storage differences between varchar and nvarchar types.

Not everyone has a need to store unicode characters, but I haven't seen any issues reported that "(var)char should be used instead of n(var)char".

If we want to provide full flexibility we should be able to choose the exact type we want - either unicode or non-unicode, BUT the default solution should be at least consistent and not mix of these two.

fabiang commented 1 month ago

I initially thought that VARCHR/CHAR were deprecated in SQLServer, but reading the documentation again this is far from true:

Starting with SQL Server 2019 (15.x), when a UTF-8 enabled collation is used, these data types store the full range of Unicode character data and use the UTF-8 character encoding.

Only TEXT, NTEXT and IMAGE are deprecated and will be removed.

I've created a small test script:

CREATE DATABASE test2 COLLATE Latin1_General_100_BIN2
GO
USE test2
CREATE TABLE collation_test (
    id INT NOT NULL IDENTITY,
    name1 VARCHAR(100) NOT NULL,
    name2 NVARCHAR(100) NOT NULL
)

INSERT INTO collation_test (name1, name2) VALUES (
    'UmlautsÄÜÖß',
    'UmlautsÄÜÖß'
)

SELECT LEN(ct.name1), LEN(ct.name2), DATALENGTH(ct.name1), DATALENGTH(ct.name2) FROM collation_test ct
GO

CREATE DATABASE test COLLATE Latin1_General_100_BIN2_UTF8
GO
USE test
CREATE TABLE collation_test (
    id INT NOT NULL IDENTITY,
    name1 VARCHAR(100) NOT NULL,
    name2 NVARCHAR(100) NOT NULL
)

INSERT INTO collation_test (name1, name2) VALUES (
    'UmlautsÄÜÖß',
    'UmlautsÄÜÖß'
)

SELECT LEN(ct.name1), LEN(ct.name2), DATALENGTH(ct.name1), DATALENGTH(ct.name2) FROM collation_test ct
GO

Result:

|  |  |  |  |
|--+--+--+--+
|11|11|11|22|

|  |  |  |  |
|--+--+--+--+
|11|11|15|22|

But these types does not accept support UTF-8, so therefore there are the following types:

So even I follow the initial issue #2323 for years now and use NVARCHAR() everywhere, I think this shouldn't be merged, as it is based in outdated information.

michalbundyra commented 1 month ago

@fabiang thanks for your comment.

Just to summarise:

you think it should not be merged,
at the same point you use nvarchar fields yourself
the current inconsistency is ok? (doctrine text type is varchar(max), but string is n(var)char(length))

Is it right?

I am not sure if I can see your point - if varchar can store now unicode (15.x+ with the correct db collation), why you still use nvarchar? Wouldn't be better to update doctrine/dbal to use consistently just varchar for SQLServer and point users to use correct collation on the db if they want to store unicode?

What then about users who are stuck (for whatever reason) on version prior to 15.x?

fabiang commented 4 weeks ago

I still use NVARCHAR because I had outdated or wrong informations. As of SQLServer 2019 I'll use VARCHAR now with the correct collation set. Therefore I think doctrine/dbal should only use VARCHAR(length)/VARCHAR(MAX) now everywhere. And imho SQLServer < 15.x support should be dropped, instead of having a BC-break.

doctrine / dbal