backdrop / backdrop-issues

Issue tracker for Backdrop core.
144 stars 40 forks source link

Make database collation consistent and configurable #6711

Open izmeez opened 2 months ago

izmeez commented 2 months ago

Description of the need

I'm adding this as a feature request because it is a request to address existing long standing methods in Drupal and Backdrop that need to be updated.

While database character set is more widely understood and utf8mb4 is the accepted norm for Backdrop and other CMS's, collation is not as well understood or appreciated. Collation refers to sort method and order which is important for some languages. Early methods such as _general_ci used short cuts for performance because of limitations of older hardware. This is no longer an impediment and there has been more attention to accuracy in the newer version _unicode_ci and even newer versions _unicode_520_ci and _900_ai_ci. This is explained in a number of posts on the Internet including, https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci.

Meanwhile, Backdrop uses the older _general_ci as its fallback or default.

In Backdrop CMS there are 3 main issues:

  1. Backdrop install appears to use the server default for collation.

Even though Backdrop uses utf8mb4_general_ci collation as it's fallback or default, a standard install appears to use the server deafult. Mysql8 default collation is _900_ai_ci, although if the server default is set differently this will affect the outcome. bd-db-collation-mysql8-default_900_ai_ci-screenshot

  1. Configuring database collation.

The documentation at https://docs.backdropcms.org/documentation/database-configuration shows that the character set can be configured by adding the line $database_charset = 'utf8mb4'; to the settings.php file. Unfortunately, there is no equivalent for collation such as $database_collation = 'utf8mb4_unicode_ci';. To configure the database collation one must use the D7 style database array.

$databases['default']['default'] = array(
  'driver' => 'mysql',
  'database' => 'database_name',
  'username' => 'username',
  'password' => 'password',
  'host' => 'localhost',
  'port' => '',
  'charset' => 'utf8mb4',
  'collation' => 'utf8mb4_unicode_ci',
);

While this works during installation to create tables with the desired collation it does not change the collation for new tables from the server default. bd-db-collation-mysql8-set-unicode_ci-screenshot

  1. Backdrop convert database changes collation to _general_ci

Regardless of the first two points when Backdrop converts a database, as in the case of a D7 upgrade, the existing tables and the setting for new tables are converted to collation _general_ci. bd-D7-upgrade-db-converted-collation-utf8mb4_general_ci-screenshot

Proposed solution

A preferred solution would be to have the ability to configure the desired collation with a simple setting such as $database_collation = 'utf8mb4_unicode_ci'; in the settings.php file, and have it apply during installation and for new tables after install and also apply to all tables if or when the database requires converting.

For mysql8 this may be changed to $database_collation = 'utf8mb4_900_ai_ci';

Alternatives that have been considered

I am not sure if there a Drupal solution, or a solution used by other CMS's, that accomplishes this.