Open michalbundyra opened 1 month ago
I initially thought that VARCHR/CHAR were deprecated in SQLServer, but reading the documentation again this is far from true:
Starting with SQL Server 2019 (15.x), when a UTF-8 enabled collation is used, these data types store the full range of Unicode character data and use the UTF-8 character encoding.
Only TEXT, NTEXT and IMAGE are deprecated and will be removed.
I've created a small test script:
CREATE DATABASE test2 COLLATE Latin1_General_100_BIN2
GO
USE test2
CREATE TABLE collation_test (
id INT NOT NULL IDENTITY,
name1 VARCHAR(100) NOT NULL,
name2 NVARCHAR(100) NOT NULL
)
INSERT INTO collation_test (name1, name2) VALUES (
'UmlautsÄÜÖß',
'UmlautsÄÜÖß'
)
SELECT LEN(ct.name1), LEN(ct.name2), DATALENGTH(ct.name1), DATALENGTH(ct.name2) FROM collation_test ct
GO
CREATE DATABASE test COLLATE Latin1_General_100_BIN2_UTF8
GO
USE test
CREATE TABLE collation_test (
id INT NOT NULL IDENTITY,
name1 VARCHAR(100) NOT NULL,
name2 NVARCHAR(100) NOT NULL
)
INSERT INTO collation_test (name1, name2) VALUES (
'UmlautsÄÜÖß',
'UmlautsÄÜÖß'
)
SELECT LEN(ct.name1), LEN(ct.name2), DATALENGTH(ct.name1), DATALENGTH(ct.name2) FROM collation_test ct
GO
Result:
| | | | |
|--+--+--+--+
|11|11|11|22|
| | | | |
|--+--+--+--+
|11|11|15|22|
But these types does not accept support UTF-8, so therefore there are the following types:
So even I follow the initial issue #2323 for years now and use NVARCHAR() everywhere, I think this shouldn't be merged, as it is based in outdated information.
@fabiang thanks for your comment.
Just to summarise:
nvarchar
fields yourself text
type is varchar(max)
, but string
is n(var)char(length)
)Is it right?
I am not sure if I can see your point - if varchar
can store now unicode (15.x+ with the correct db collation), why you still use nvarchar
?
Wouldn't be better to update doctrine/dbal to use consistently just varchar
for SQLServer and point users to use correct collation on the db if they want to store unicode?
What then about users who are stuck (for whatever reason) on version prior to 15.x?
I still use NVARCHAR because I had outdated or wrong informations. As of SQLServer 2019 I'll use VARCHAR now with the correct collation set. Therefore I think doctrine/dbal should only use VARCHAR(length)/VARCHAR(MAX) now everywhere. And imho SQLServer < 15.x support should be dropped, instead of having a BC-break.
Summary
Very old bug for SQLServer platform preventing usage unicode characters with "text" type.
Background
Here are previously reported issues and attempts to fix it:
1182
1351
2323
5237
Before it was unfortunately not classified as a bug - see https://github.com/doctrine/dbal/pull/5237#issuecomment-1029294556 but let me provide here a bit more light, why actually IT IS a bug.
First of all in SQLServer we have types like:
varchar
(variable length)char
(fixed length)text
(deprecated in favour ofvarchar(max)
)But these types does not accept support UTF-8, so therefore there are the following types:
nvarchar
(variable length)nchar
(fixed length)ntext
(deprecated in favour ofnvarchar(max)
)At some point, when doctrine evolved SQLServer platform from MSSQL the change has been made, but not in all places - see https://github.com/doctrine/dbal/commit/8b29ffe5a53e4b823615e2ffc2b39a04aab549e9:
So it was recognised that we suppose to use
nvarchar
in some places, but notntext
in the other. And since then it is just there.Later it was just updated to use
varchar(max)
instead oftext
(astext
become deprecated): https://github.com/doctrine/dbal/pull/451Current situation
Now, on version 4.0.x, we still have no ability to use text field with SQLServer for UTF-8 characters.
The other fields:
n
version of the columns by, but not the text one: https://github.com/doctrine/dbal/blob/cbd0e9adb46ec66f22c163e7003d0f4b2e38c8c7/src/Platforms/SQLServerPlatform.php#L910-L913and there is no easy option to use it, but overriding the type to provide support of length=-1 on the type level (so migrations is not trying to create nvarchar(-1) but uses nvarchar(max); -1 is reported back from the schema when we use max length, and this is also the same behaviour for varchar field).
Additional considerations
As it would solve the current issue, it's not a perfect solution. It might be even considered as a BC break for some, especially when taking into account storage differences between
varchar
andnvarchar
types.Not everyone has a need to store unicode characters, but I haven't seen any issues reported that "
(var)char
should be used instead ofn(var)char
".If we want to provide full flexibility we should be able to choose the exact type we want - either unicode or non-unicode, BUT the default solution should be at least consistent and not mix of these two.