PomeloFoundation / Pomelo.EntityFrameworkCore.MySql

Entity Framework Core provider for MySQL and MariaDB built on top of MySqlConnector
MIT License
2.7k stars 383 forks source link

MySQL collation issue in EF Core #553

Closed viveknuna closed 6 years ago

viveknuna commented 6 years ago

My database is MySQL. I generate migration and update database, then I see the collation of table which is different on different machines. On my machine its latin1 - default collation and on a different machine, it is utf8 - default collation.

So due to this behavior, I'm getting an exception while inserting Japanese word (日本語) into a table which has a varchar(64) column. It works fine on another machine which has utf8 - default collation on this table.

EF Core Version: 2.0.1 MySql: 5.7.22 Pomelo.EntityFrameworkCore.MySql: 2.0.1

Robin-Clemens commented 6 years ago

Correct, infact now that you mention it. I saw a change in the 2.1.0 preview branch that probably needs adjustment.

on: https://github.com/PomeloFoundation/Pomelo.EntityFrameworkCore.MySql/blob/f_2.1.0-spec-tests/src/EFCore.MySql/Storage/Internal/MySqlTypeMappingSource.cs

on line 332

var suffix = isAnsi ? " CHARACTER SET latin1" : " CHARACTER SET ucs2";

should (in my opinion) be removed. Let MySQL decide the default charset for columns instead.

for reference (MySQL 8.0) https://dev.mysql.com/doc/refman/8.0/en/charset-applications.html

For applications that store data using the default MySQL character set and collation (utf8mb4, utf8mb4_0900_ai_ci),

Edit: fixed links

viveknuna commented 6 years ago

I'm getting 404 - Page not found on clicking the above links.

Robin-Clemens commented 6 years ago

My apologies, seems I put the links in wrong. Fixed now.

viveknuna commented 6 years ago

So When this be fixed completely?

ajcvickers commented 6 years ago

@Giantaxe So if the application requests that Unicode strings are saved, but the database is not setup with a collation that allows Unicode characters, what is your expectation of what the behavior should be? Also, is there a typically used default collation on MySQL and if so, does it support Unicode characters?

Robin-Clemens commented 6 years ago

@ajcvickers On MySQL 5.7 the default character set was Latin1, even though Oracle recommended using UTF8MB4? (Well as far as i know anyway)

Then UTF8MB4 came around and became the default in MySQL 8.0 (Now in general release) with full support for unicode/emojis etc.

Explicitly defining latin1/ucs2 would make no sense in the case of MySQL. (atleast when you use the default character set MySQL gives you.)

That said, last time I worked with SQL Server (Quite a few years ago), those still use the Latin1 character set as default. So for that I would totally understand explicitly defining another character set when in unicode.

What I personally would recommend would be adhering to the database default character set (Be it latin1, UTF8, UTF8MB4 or anything else) and use a decorator/fluent expression to explicitly define another one if need be.

// Database was made in MySQL 5.7, defaulted to Latin 1.
// Needs to support Emoji here, make it UTF8MB4
[Column(type = "varchar(20) character set utf8mb4")]
public string EmojiMessage { get; set; }

So, in short If the database character set is UTF8MB4, use UTF8MB4 for columns unless explicitly defined. If the database character set is Latin1, use is Latin1 for columns unless explicitly defined.

The only issue remains then is to somehow set a default/custom character set when creating a database through migrations. Though one might argue you could just edit the initial migration and rewrite the create database statement with a character set defined.

This is purely my opinion though so other thoughts are appreciates as well.

caleblloyd commented 6 years ago

I tend to agree with @Giantaxe. I think that most MySQL users expect their string columns to be generated using their database default charset.

Our recommendation out of the gate in the https://github.com/PomeloFoundation/Pomelo.EntityFrameworkCore.MySql#getting-started is to use utf8mb4.

I don't think we've included charset before, so I'm leaning towards keeping the default behavior the same. This actually might be a good Provider-level option though - allow the user to specify Unicode and ANSI charsets otherwise use the database default.

There's a pretty long discussion about it in #84.

AndriySvyryd commented 6 years ago

Using the default charset could be ok for non-key columns. Though this is the opposite of what @viveknuna asked for.

However key and index columns have a max bound on length: 767 and 3072 byte for pre-5.7.7 and for 5.7.7+ respectively. So if the server default charset is used we would need to assume the worst - that it's utf8mb4 and could use up to 4 bytes per character meaning that the max length for the string column in a key or index would be 191 and 768. And this would only be enough for a single string column, so if the user needs to specify a composite key or index involving a string column they would always have to specify the correct max length explicitly to accommodate the other columns.

caleblloyd commented 6 years ago

the max length for the string column in a key or index would be 191 and 768

This is actually a very real pain point, so there may be value in specifying the charset for Key and Index columns. Does the current behavior only specify for Key and Index columns or does it specify for all columns?

Also what is your opinion of implementing it by introducing options in MySqlDbContextOptionsBuilder for WithUnicodeCharSet and WithAnsiCharSet?

AndriySvyryd commented 6 years ago

@caleblloyd Currently the charset is always specified.

A more flexible set of options would be:

caleblloyd commented 6 years ago

I've got a similar design over in #569 now if you would like to comment on it! Once we nail down the design there I'll work on some tests.

moldovans commented 6 years ago

Is there someone who can explain, how to support accents and like é, à, û, by default when creating a CodeFirst migration and having a property, say "public string Message {get; set; }" for the MySQL database or similar?

mguinness commented 6 years ago

@serhioV What is the default collation on your database? The default is that columns will be unicode unless you use .IsUnicode(false) in your DbContext.

moldovans commented 6 years ago

@serhioV What is the default collation on your database? The default is that columns will be unicode unless you use .IsUnicode(false) in your DbContext.

linked with the https://github.com/HeidiSQL/HeidiSQL/issues/356, probably that is rather the MySQL client issue, than the EF one... however is interesting to be able to change the desired encoding to support different chars (Russian Cyrillic or Chinese, by eg)