False positive match for certain unicode characters

firasdib / Regex101

This repository is currently only used for issue tracking for www.regex101.com

3.25k stars 199 forks source link

False positive match for certain unicode characters #2283

Closed kkmuffme closed 2 weeks ago

kkmuffme commented 4 months ago

Bug Description

Reproduction steps

PHP >= 7.3

put regex:

a[–]b

for text:

a–b

Expected Outcome

No match. Only when selecting the "u" flag it should match.

See https://github.com/php/php-src/issues/14306

Browser

Chrome 124

OS

Win 11

damianwadley commented 4 months ago

Not a bug, see my reply in the php/php-src issue.

kkmuffme commented 4 months ago

I think you misunderstood the issue - it's NOT a bug in PHP (which is why I closed the issue in php-src before your reply already, as I realized that was the case).

But it's a bug in regex101 - bc regex101 shows a match even WITHOUT u flag (while PHP does not). This behavior difference is the bug.

damianwadley commented 4 months ago

Ah, I see what you mean... I'm guessing perhaps the WASM version implicitly supports non-ASCII strings? But I'm not sure what flavor library is involved here, or if it's a custom build specifically for the site.

Sorry @working-name, does seem there is a problem here after all 😓

firasdib commented 4 months ago

Can this be because of the fact that the website uses UTF-16 while php uses UTF-8?

kkmuffme commented 4 months ago

Possibly. I guess the solution is similar to what happens already now when you use a[💩]b => when the "u" flag is not set I see 2 ? boxes - when I select "u" it shows the emoji correctly. This is a 4 byte emoji, while – is 3 bytes.

Please reopen the issue.

kkmuffme commented 4 months ago

Just tested and this bug exists for ALL multi-byte characters even basic UTF-8. e.g. /a[ä]b/ for text aäb => in regex101 shows a match, in PHP it's not a match.

REASON WHY: PHP preg is single byte only, unless "u" flag is provided. utf-8 = 2 byte, utf-16 = 3 byte, utf-32 = 4 byte

Since 💩 is 4 byte it works correctly as it's not valid in utf-16 - but – is 3 byte and ä is 2 byte, therefore it's a valid in utf-16 which leads to this false positives

Pattern and subject strings are treated as UTF-8

https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

Solution: convert pattern and text into ISO-8859-1 without "u" flag and to "utf-8" with "u" flag?

firasdib commented 2 weeks ago

Text conversion is going to be very costly, performance wise. I think we'll have to live with this limitation.