PostgreSQL-For-Wordpress / postgresql-for-wordpress

A maintained fork of https://wordpress.org/plugins/postgresql-for-wordpress/
GNU General Public License v2.0
208 stars 67 forks source link

Wrong textmb4 data type causing issues #119

Open hahnn opened 1 month ago

hahnn commented 1 month ago

There is an issue when translating the sources column in PostgreSQL as shown below.

In this case, the sources column should keep its text data type.

[1717320100.4105] Error running :
CREATE TABLE wp_rank_math_redirections (
                                id bigint(20) unsigned NOT NULL auto_increment,
                                sources text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
                                url_to text NOT NULL,
                                header_code smallint(4) unsigned NOT NULL,
                                hits bigint(20) unsigned NOT NULL default '0',
                                status varchar(25) NOT NULL default 'active',
                                created datetime NOT NULL default '0000-00-00 00:00:00',
                                updated datetime NOT NULL default '0000-00-00 00:00:00',
                                last_accessed datetime NOT NULL default '0000-00-00 00:00:00',
                                PRIMARY KEY  (id),
                                KEY status (status)
                        ) DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci
---- converted to ----
CREATE TABLE IF NOT EXISTS wp_rank_math_redirections (
                                id bigserial,
                                sources textmb4 COLLATE utf8mb4_bin NOT NULL,
                                url_to text NOT NULL,
                                header_code smallint  NOT NULL,
                                hits bigint  NOT NULL default '0',
                                status varchar(25) NOT NULL default 'active',
                                created timestamp NOT NULL DEFAULT now(),
                                updated timestamp NOT NULL DEFAULT now(),
                                last_accessed timestamp NOT NULL DEFAULT now(),
                                PRIMARY KEY  (id)
                        );
CREATE INDEX IF NOT EXISTS wp_rank_math_redirections_status ON wp_rank_math_redirections (status);
----> ERREUR:  le type « textmb4 » n'existe pas
LINE 3:     sources textmb4 COLLATE utf8mb4_bin NOT NULL,
                    ^
---------------------

That's the same issue there with the from_url column:

[1717320100.4263] Error running :
CREATE TABLE wp_rank_math_redirections_cache (
                                id bigint(20) unsigned NOT NULL auto_increment,
                                from_url text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
                                redirection_id bigint(20) unsigned NOT NULL,
                                object_id bigint(20) unsigned NOT NULL default '0',
                                object_type varchar(10) NOT NULL default 'post',
                                is_redirected tinyint(1) NOT NULL default '0',
                                PRIMARY KEY  (id),
                                KEY redirection_id (redirection_id)
                        ) DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci
---- converted to ----
CREATE TABLE IF NOT EXISTS wp_rank_math_redirections_cache (
                                id bigserial,
                                from_url textmb4 COLLATE utf8mb4_bin NOT NULL,
                                redirection_id bigint  NOT NULL,
                                object_id bigint  NOT NULL default '0',
                                object_type varchar(10) NOT NULL default 'post',
                                is_redirected smallint NOT NULL default '0',
                                PRIMARY KEY  (id)
                        );
CREATE INDEX IF NOT EXISTS wp_rank_math_redirections_cache_redirection_id ON wp_rank_math_redirections_cache (redirection_id);
----> ERREUR:  le type « textmb4 » n'existe pas
LINE 3:     from_url textmb4 COLLATE utf8mb4_bin NOT NULL,
                     ^
---------------------
mattbucci commented 1 month ago

to me this looks like utf8mb4 got trimmed down to mb4 by mistake rather than being entirely removed as it should be.

Postgres does not have utf8mb4 type, all utf8 are 4 bytes in postgres. only mysql has this oddity of 3 byte utf8

Current behavior

Mysql:

sources text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,

Expected:

sources text COLLATE utf8 NOT NULL,

or less accurately

sources text NOT NULL,

Actual:

sources textmb4 COLLATE utf8mb4_bin NOT NULL,
mattbucci commented 1 month ago

Here's the offending line, this comes from the old code from HAWK

https://github.com/PostgreSQL-For-Wordpress/postgresql-for-wordpress/blob/d8b83c88e29b70e8dc0fb37dc351049f52427f80/pg4wp/rewriters/CreateTableSQLRewriter.php#L21

it would be good to remove all these lines and replace them with a generic parser of some sort

        ' DEFAULT CHARACTER SET utf8mb4' => '',
        ' DEFAULT CHARACTER SET utf8'   => '',
        ' COLLATE utf8mb4_unicode_520_ci'   => '',
        ' COLLATE utf8_general_ci'  => '',
        ' CHARACTER SET utf8' => '',
        ' DEFAULT CHARSET=utf8' => '',

These can also be found in the Alter Table rewriter

First let's address the create table statements which are perfectly valid in mysql but don't work for postgres. utf8mb4 does not exist, neither does the utf8mb4_unicode_520_ci

DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci

at best can be translated to

WITH ENCODING 'UTF8'
LC_COLLATE = 'en_US.utf8'
LC_CTYPE = 'en_US.utf8';

We can replace any instances of DEFAULT CHARACTER SET utf8.*?(?=\s|$|\n) COLLATE utf8.*?(?=\s|$|\n) with this set of rules. Eventually someone will complain that we collate everything to en_US, so instead I think we should just remove it entirely and let users set it when creating the database. so we can just search for DEFAULT CHARACTER SET utf8.*?(?=\s|$|\n) Then search for COLLATE utf8.*?(?=\s|$|\n) and remove it

Looking at our previous replacements we also need to check for =utf8 and replace it so we can search for the following variations

DEFAULT CHARACTER=utf8.*?(?=\s|$|\n) 
DEFAULT CHARACTER = utf8.*?(?=\s|$|\n)