craftcms / cms

Build bespoke content experiences with Craft.
https://craftcms.com
Other
3.29k stars 638 forks source link

[4.x]: Page with non-ASCII characters in the slug returns 200 #16043

Closed Romanavr closed 3 weeks ago

Romanavr commented 3 weeks ago

What happened?

Description

More context for this problem can be found in the following issue, https://github.com/putyourlightson/craft-blitz/issues/730

The main problem for me is that it's possible to visit a page with non-ASCII characters, while slug is set to ASCII only, pure English. Also, 'limitAutoSlugsToAscii' => true.

As I understand, Craft in-depth checking against non-ASCII chars and replacing them while looking for necessary route. I would like to have an option to avoid such behaviour as it causes problem in my case(with the Blitz plugin). At least, it makes sense to do this for cases when limitAutoSlugsToAscii is set to true

Steps to reproduce

  1. Create any page with the english slug, without ASCII symbols, https://putyourlightson.com/plugins (plugins)
  2. Visit this page by replacing some characters with non-ASCII characters, for example, https://putyourlightson.com/plugi%C5%84s (plugińs)

Expected behavior

The page says 200 and loads the original content

Actual behavior

It should return a 404 and say that no such page was found.

Craft CMS version

Pro 4.12.0

PHP version

8.2.13

Operating system and version

Ubuntu

Database type and version

MySQL8.0.39

Image driver and version

No response

Installed plugins and versions

i-just commented 3 weeks ago

Hi, thanks for reaching out!

First of all, let me clarify that the only purpose of limitAutoSlugsToAscii is to enforce the auto-generated slugs to contain only ASCII characters. Once that’s done, there’s nothing stopping a user from amending the slug to contain non-ASCII characters.

That said, I see the behaviour you’re referring to. It comes from your database (from MySQL), not Craft, and it varies depending on the selected charset and collation for your tables.

When a web request is made and it reaches Craft, Craft will first figure out where to route it.

In this case, we’re talking about having an entry with a slug of plugins. So, when UrlManager->_getMatchedElementRoute() is reached, it’ll eventually work its way down to attempting to find an element by URI. To do so, it’ll perform a query that attempts to find an element that has the elements_sites.uri set to the requested path. Querying the elements_sites table where uri is plugins and then performing the same request for uri plugińs can return the exact same element depending on what charset and collation you’re using.

For example, if I use charset utf8mb4 and collation utf8mb4_0900_ai_ci, querying where uri is plugins and then where uri is plugińs returns the exact same element. If I change my collation to, e.g. utf8mb4_pl_0900_ai_ci, querying where uri is plugins will return an entry, but querying where uri is plugińs will not.

I hope this helps clarify things!

I’ll close this now, but feel free to reply if you run into any further issues.

Romanavr commented 2 weeks ago

Hi, thanks for reaching out!

First of all, let me clarify that the only purpose of limitAutoSlugsToAscii is to enforce the auto-generated slugs to contain only ASCII characters. Once that’s done, there’s nothing stopping a user from amending the slug to contain non-ASCII characters.

That said, I see the behaviour you’re referring to. It comes from your database (from MySQL), not Craft, and it varies depending on the selected charset and collation for your tables.

When a web request is made and it reaches Craft, Craft will first figure out where to route it.

In this case, we’re talking about having an entry with a slug of plugins. So, when UrlManager->_getMatchedElementRoute() is reached, it’ll eventually work its way down to attempting to find an element by URI. To do so, it’ll perform a query that attempts to find an element that has the elements_sites.uri set to the requested path. Querying the elements_sites table where uri is plugins and then performing the same request for uri plugińs can return the exact same element depending on what charset and collation you’re using.

For example, if I use charset utf8mb4 and collation utf8mb4_0900_ai_ci, querying where uri is plugins and then where uri is plugińs returns the exact same element. If I change my collation to, e.g. utf8mb4_pl_0900_ai_ci, querying where uri is plugins will return an entry, but querying where uri is plugińs will not.

I hope this helps clarify things!

I’ll close this now, but feel free to reply if you run into any further issues.

Absolutely, thank you very much for the detailed explanation!