ampproject / amp-toolbox-php

AMP Optimizer PHP library
Apache License 2.0
73 stars 25 forks source link

Bug: Transformer\MinifyHtml breaks cyrillic letter 'х' followed by whitespace #420

Closed Likhachev closed 2 years ago

Likhachev commented 3 years ago
$unoptimizedHtml = "
А а Б б В в Г г Д д Е е Ё ё
Ж ж З з И и Й й К к Л л М м
Н н О о П п Р р С с Т т У у
Ф ф Х х Ц ц Ч ч Ш ш Щ щ Ъ ъ
Ы ы Ь ь Э э Ю ю Я я";

$transformationEngine = new TransformationEngine();
$errorCollection      = new ErrorCollection; 

$optimizedHtml = $transformationEngine->optimizeHtml(
    $unoptimizedHtml, 
    $errorCollection 
);

Output

А а Б б В в Г г Д д Е е Ё ё
Ж ж З з И и Й й К к Л л М м
Н н О о П п Р р С с Т т У у
Ф ф Х � Ц ц Ч ч Ш ш Щ щ Ъ ъ
Ы ы Ь ь Э э Ю ю Я я

Works only with space right after the letter:

ххххх => хххх� but ххххха => ххххха

The problem in \AmpProject\Optimizer\Transformer\MinifyHtml::normalizeWhitespace

Temporary workaround - turn off whitespace collapsing

$configurationData = [
    Transformer\MinifyHtml::class => [
        Configuration\MinifyHtmlConfiguration::COLLAPSE_WHITESPACE => false,
    ],
];
ediamin commented 3 years ago

@Likhachev Thanks for reporting the bug. I've found that whitespace around Cyrillic х has causes issue when we try to use the vertical tab \v in the preg_replace function. I'll post an update with possible fix.

swissspidy commented 2 years ago

Seeing similar issues with Persian strings like داروی جامع امام رضا + کاربرد و نحوه مصرف دارو:

Screenshot 2021-12-01 at 21 16 35

Appearance is fine again when not using the MinifyHtml transformer.

schlessera commented 2 years ago

This issue should have been solved with #421. @swissspidy can you confirm by testing with 0.9.0+?

swissspidy commented 2 years ago

Can confirm that this is now fixed!

schlessera commented 2 years ago

Fixed via #421