elementor / wp2static

WordPress static site generator for security, performance and cost benefits
https://wp2static.com
The Unlicense
1.43k stars 270 forks source link

Not properly working with unicode websites #220

Closed dimobelov closed 5 years ago

dimobelov commented 5 years ago

Fix: https://github.com/leonstafford/wp2static/blob/e7bc17116859a10c0b8b2c1c95f0215dda3b6ca3/library/StaticHtmlOutput/HTMLProcessor.php#L611

    {
        $processed_html = $this->xml_doc->saveHtml();

        // process the resulting HTML as text
        $processed_html = $this->detectEscapedSiteURLs($processed_html);
        $processed_html = $this->detectUnchangedURLs($processed_html);

        $processed_html = html_entity_decode($processed_html, ENT_QUOTES, 'UTF-8');     

        return $processed_html;
    }
leonstafford commented 5 years ago

oh, great, thanks Dimo!

Do you have some before/after source code I could add to a test for this?

leonstafford commented 5 years ago

as in, the HTML source that was previously not being converted properly

dimobelov commented 5 years ago

Snippet from homepage index.html. Without fix:

<meta name="viewport" content="width=device-width, initial-scale=1">

    <title>wpnotes | &#1055;&#1086;&#1088;&#1077;&#1076;&#1085;&#1080;&#1103;&#1090; WordPress &#1089;&#1072;&#1081;&#1090;</title>

<meta name="description" content="&#1055;&#1086;&#1088;&#1077;&#1076;&#1085;&#1080;&#1103;&#1090; WordPress &#1089;&#1072;&#1081;&#1090; on wpnotes&hellip;">
<meta property="og:locale" content="en_US">
<meta property="og:type" content="website">
<meta property="og:title" content="wpnotes | &#1055;&#1086;&#1088;&#1077;&#1076;&#1085;&#1080;&#1103;&#1090; WordPress &#1089;&#1072;&#1081;&#1090;">
<meta property="og:description" content="&#1055;&#1086;&#1088;&#1077;&#1076;&#1085;&#1080;&#1103;&#1090; WordPress &#1089;&#1072;&#1081;&#1090; on wpnotes&hellip;">
<meta property="og:url" content="https://dimobelov.gitlab.io/wpstatic/">
<meta property="og:site_name" content="wpnotes">
<meta name="twitter:card" content="summary"> ...

With fix:

    <meta name="viewport" content="width=device-width, initial-scale=1">

    <title>wpnotes | Поредният WordPress сайт</title>

<meta name="description" content="Поредният WordPress сайт on wpnotes…">
<meta property="og:locale" content="en_US">
<meta property="og:type" content="website">
<meta property="og:title" content="wpnotes | Поредният WordPress сайт">
<meta property="og:description" content="Поредният WordPress сайт on wpnotes…">
<meta property="og:url" content="https://dimobelov.gitlab.io/wpstatic/">
<meta property="og:site_name" content="wpnotes">
<meta name="twitter:card" content="summary">
<meta name="twitter:title" content="wpnotes | Поредният WordPress сайт">
<meta name="twitter:description" content="Поредният WordPress сайт on wpnotes…">
...
leonstafford commented 5 years ago

Perfect, thanks!

A few other fixes, cleanups and tests being added at the moment, I'll get this into the next release.

dimobelov commented 5 years ago

Btw its good to double decode output html. This is known issue with unicode.

  $processed_html = html_entity_decode($processed_html, ENT_QUOTES, 'UTF-8');
/// and again
  $processed_html = html_entity_decode($processed_html, ENT_QUOTES, 'UTF-8');
leonstafford commented 5 years ago

@dimobelov, I'm not having any joy with testing this:

HTMLProcessorUnicodeSupport
 ✘ Unicode output data set "unicode characters in source"
   │
   │ Failed asserting that two strings are equal.
   │ --- Expected
   │ +++ Actual
   │ @@ @@
   │  '<!DOCTYPE html>\n
   │ -<html lang="en-US"><head></head><title>wpnotes | Поредният WordPress сайт</title><body></body></html>\n                              
   │ +<html lang="en-US"><head></head><title>wpnotes | Ð&#159;оÑ&#128;едниÑ&#143;Ñ&#130; WordPress Ñ&#129;айÑ&#130;</title><body></body></html>\n
   │  '
   │
   │ /home/leon/example.com/site/web/app/plugins/static-html-output-plugin/provisioning/tests/HTMLProcessor/unicodeSupportTest.php:48      

Even with multiple decodings... Any ideas?

leonstafford commented 5 years ago

OK, some progress with adding <meta charset="utf-8"/> to the test inputs.