Prevent Unicode conversion to 8-bit code page in wp-db.php

srutzky commented 3 years ago

This fixes #417 .

Add upper-case "N" to quoted string parameter. Without this (and without the database in which Project Nami is installed having a UTF-8 default collation, and most do not), then Unicode characters get converted to the 8-bit code page associated with the default collation of the database in which Project Nami is installed.

In both approaches, UTF-8 mode does need to be enabled for the SQL Server driver / connection.

This change (adding a single character) will fix Project Nami running on versions of SQL Server prior to 2019 (which introduced the _UTF8 collations).

While I've not tested this change myself, minimal testing has been done as noted in the following forum thread (starting at the linked post):

https://www.sqlservercentral.com/forums/topic/can-post-please-be-stored-in-nvarchar#post-3834511

Still, it would probably be a good idea to do more extensive testing, just to be sure given the nature of this change.

Take care, Solomon... https://SqlQuantumLift.com/ https://SqlQuantumLeap.com/ https://SQLsharp.com/

srutzky commented 3 years ago

I just checked the schema and can't find any columns of the following datatypes:

DATE
TIME
DATETIMEOFFSET
UNIQUEIDENTIFIER
SQL_VARIANT
DATETIME (all "timestamps" are stored as DATETIME2)

There are only 11 date columns (technically 14, but 3 of them have a duplicate in GMT), and some appear to be non-updatable (e.g. "registered_date").

So that reduces the number of potential issues.

Looking at the schema I found other areas that should be tested.

Some dates that might not have been tested:

Email User signing up / registering (signups table)
User activating their registration (signups table)
WordPress follower registration (registration_log table, I think)
Updating a user (users table)
Creating / Updating a link (links table)
Updating blog metadata / appearance / layout / etc (blog_versions table)
Creating / updating blogs on a multi-site installation (blogs table)

Some string columns that might not have been tested:

Terms (are these Categories? they have a "slug" and a taxonomy that includes a "description") (terms and term_taxonomy tables)
Comments: "author" (comments table)
Links: "name", "description", and "notes" (links table)
Options? "name" (options table)
Posts: "name" (different from "title"), "excerpt", and "password" (posts table)
Users: "login", "password", "nicename", and "display_name" (users table)
Signups (WordPress users, I think): "title" (their blog title, I think) (signups table)

patrickebates commented 3 years ago

Terms are both Categories and Tags. Hopefully I'll have time this weekend to dig into this myself. Thanks for the research on this.

patrickebates commented 3 years ago

Regarding the consistent use of DATETIME2, it was the closest match I could find to the expected behavior of MySQL dates. Even then, there was still the problem of the normal default date in WP being invalid to SQL Server.

srutzky commented 3 years ago

@patrickebates Yer welcome 😺. And sounds good re: testing / DATETIME2 / etc.

To have everything in one place, here is the testing that was discussed (and Thom A. did and found to be successful) in that SqlServerCentral.com/forums thread (all non-Date tests assume using one or more supplementary characters):

Create / Update Post / Page. Check for:
1. Title
2. Content
3. Description
4. Post Slug
After post is created, change post date to a future date (or a different future date).
Create / Update Tags
Create / Update Categories

It was also suggested that we test filenames for media (are those stored in the links table?). For example, you can have (or at least upload from):

C:\some\path\👾.png
C:\some\path\test_𒍅𒍖.xlsx

Finally, there are some areas that I didn't include in the previous comments list of tables/columns:

Post / Comment / User meta — names or values
Email addresses
URLs

If everything else works, I don't see how these wouldn't. And it would be rare (though not impossible) to even have non-standard ASCII characters in those columns.

What about plugins? In that RegEx issue I worked on 2 months ago you stated that there are plugins that use their own SQL directly instead of the WP query objects. I just did a quick check of translations.php and don't see a generic "prepare" handling like this PR deals with, but I did find some areas of concern (not an exhaustive list):

https://github.com/ProjectNami/projectnami/blob/master/wp-includes/translations.php#L1241 (translate_replace_casting)
https://github.com/ProjectNami/projectnami/blob/master/wp-includes/translations.php#L1316 (translate_column_type — very minor)
https://github.com/ProjectNami/projectnami/blob/master/wp-includes/translations.php#L1460 (translate_incompat_data_type — not sure why varchar and nvarchar are included in this list, if I am reading it correctly that it translates = into LIKE because text and ntext can't be compared with =, <>; either way, that loop looks like a place that might need the N added if the comparisons can be to string literals)
Unsure about:
- https://github.com/ProjectNami/projectnami/blob/master/wp-includes/translations.php#L1950 (on_update_to_merge)
- https://github.com/ProjectNami/projectnami/blob/master/wp-includes/translations.php#L2138 (translate_if_not_exists_insert_merge)
- These might be just too much to deal with, not worth the trouble, and plugins using such statements might just be on their own 😉

patrickebates commented 3 years ago

Media is actually stored as posts. I believe the Links table was abandoned in Core shortly after we started work on PN. The functionality was moved to a plugin and the table left in place.

patrickebates commented 3 years ago

There's a function we added to resolve some issues with writing options that bypasses Prepare. What are your thoughts on changes it might need? https://github.com/ProjectNami/projectnami/blob/1ca8c1baaf856d01d972eeb0d13015bca29cedf2/wp-includes/option.php#L582 https://github.com/ProjectNami/projectnami/blob/1ca8c1baaf856d01d972eeb0d13015bca29cedf2/wp-includes/wp-db.php#L3738

srutzky commented 3 years ago

Hi @patrickebates

Media is actually stored as posts.

So, "comments", which are structurally quite similar to posts/pages have their own table, yet "media" which appears to be not nearly as similar to posts/pages are stored in the "posts" table? Sometimes I wonder if the folks who originally designed WordPress made an agreement with hardware vendors to get a kick-back on sales of CPU, RAM, and storage for systems running WordPress ;-). I mean, there are some parts of that data model that are sheer insanity.

There's a function we added to resolve some issues with writing options that bypasses Prepare. What are your thoughts on changes it might need?

As far as I can see, there don't appear to be any issues in either of those locations due to them using parameterized queries instead of string literals. As long as the parameters are declared as being NVARCHAR / Unicode string, then nothing more to do there.

patrickebates commented 3 years ago

The data model is beyond insanity. After getting this deep in Core, I think Spencer and I are qualified to declare that. I would go so far as to say that Spencer is one of only a handful of people currently with Automattic who knows just how twisted it is...

The parameterized query is a bit tricky as well. SqlSrv operates strictly at the ODBC level. There's no way to declare type on the parameters that I'm aware of. Can't name parameters either, and therefore can't pass them out of order or reuse a single parameter in multiple places in the query. Have to pass it multiple times in the correct position.

LarnuUK commented 3 years ago

There's a function we added to resolve some issues with writing options that bypasses Prepare. What are your thoughts on changes it might need?

https://github.com/ProjectNami/projectnami/blob/1ca8c1baaf856d01d972eeb0d13015bca29cedf2/wp-includes/option.php#L582

https://github.com/ProjectNami/projectnami/blob/1ca8c1baaf856d01d972eeb0d13015bca29cedf2/wp-includes/wp-db.php#L3738

Hmm, looking at that query, that looks like a MERGE antipattern. Checking if a row exists first, then inserting it if it wasn't found, and otherwise updating said row firstly results in more seeks/scans against the table, but also (and far worse), it could suffer a race condition.

This is actually made even worse by the NOLOCK hint there, as even if you did have an explicit table or row lock (which you would need to stop the race condition) that lock wouldn't stop the EXISTS query. Also, if a row was being inserted with the value in another transactions, and then the other transaction was rolled back the UPDATE statement would update no rows in this transaction (as the row no longer exists).

Bertrand writes a great article on this, but ideally that statement should probably be changed as it could easily result in (very) unexpected behaviour.

I would attempt this, but the parameter count will change, and my familiarity with PHP is very old; I haven't worked with it for about 15 years.

patrickebates commented 3 years ago

That lock hint is actually a holdover from some work done prior to fully converting it to parameters. Might need to be changed to an UPDLOCK at this point.

LarnuUK commented 3 years ago

Using UPDLOCK would certainly stop the race conditions. Changing the ordering, to attempt an UPDATE and then only INSERT if no rows are effected would be the ideal, however, that would require changes to the calling PHP as well, as it would result in fewer parameters being passed.

srutzky commented 3 years ago

@patrickebates Hi there. I was just checking in regarding the status of this PR. I realize that there was additional discussion from @LarnuUK re: the "MERGE antipattern" and NOLOCK vs UPDLOCK, but those two items are unrelated and could/should be dealt with separately. Is the delay a matter of waiting for resolution of those items, or a matter of testing, or something else? Just curious.

@LarnuUK Speaking of testing, I assume you have been running your installation with this change for the past 9 months? If so, have you run into any problems or potentially odd behavior? I don't have time or ability to do this testing myself so any feedback would help. Thanks!

LarnuUK commented 3 years ago

@srutzky yep, I have been on a couple of sites and have had no issues.

patrickebates commented 3 years ago

Going to push this with the update for WP 5.8.1

adam-czaplinski-vavatech commented 10 months ago

I think this fix was mistakenly reverted during merge 4b6b3d4 „Initial merge with WP 6.1.1” :(

LarnuUK commented 10 months ago

Looks like the file wp-includes/wp-db.php has been deprecated, @adam-czaplinski-vavatech and replaced by wp-includes/class-wpdb.php, however, the new file does have the change made from this merge code on line 1470:

$query = preg_replace( '/(?<!%)%s/', " N'%s'", $query ); // Quote the strings, avoiding escaped strings like %%s.

adam-czaplinski-vavatech commented 10 months ago

Thank you, yes, you are right. But the fix does not exist in the current master nor latest versions, does it? Now I think another merge - commit 467da7e „Initial merge with WP 6.2.0” reverts this fix.

LarnuUK commented 10 months ago

Yes, the commit to merge 6.2.0 completely removes the line, @adam-czaplinski-vavatech :

LarnuUK commented 10 months ago

I've posted an Issue, https://github.com/ProjectNami/projectnami/issues/515, to note the regression.

srutzky commented 10 months ago

I've posted an Issue, #515, to note the regression.

@LarnuUK and @adam-czaplinski-vavatech : I am looking into this and should be able to submit a pull request in just a moment.

ProjectNami / projectnami

Prevent Unicode conversion to 8-bit code page in wp-db.php #422