ampproject / amp-toolbox-php

AMP Optimizer PHP library
Apache License 2.0
74 stars 23 forks source link

UTF8 Encoding/escaping issue #539

Closed tapan-thapa closed 11 months ago

tapan-thapa commented 11 months ago

Hello,

My raw HTML (AMP HTML) is having application/ld+json schema with "Hindi" characters which is getting converted as below in output.

{
  "@type":"WebPage",
  "name":"Umesh Pal Murder Case: \u0909\u092e\u0947\u0936 \u092a\u093e\u0932 \u0939\u0924\u094d\u092f\u093e\u0915\u093e\u0902\u0921 \u092e\u0947\u0902 \u0906\u091c \u0926\u093e\u0916\u093f\u0932 \u0939\u094b\u0917\u0940 \u091a\u093e\u0930\u094d\u091c\u0936\u0940\u091f, \u0907\u0928 9 \u0906\u0930\u094b\u092a\u093f\u092f\u094b\u0902 \u0915\u0947 \u0939\u094b \u0938\u0915\u0924\u0947 \u0939\u0948\u0902 \u0928\u093e\u092e | Newstrack in Hindi",
  "description":"Umesh Pal Murder Case: \u092a\u094d\u0930\u092f\u093e\u0917\u0930\u093e\u091c \u092a\u0941\u0932\u093f\u0938 \u0907\u0938 \u092e\u093e\u092e\u0932\u0947 \u092e\u0947\u0902 \u0906\u091c \u092f\u093e\u0928\u0940 \u0936\u0941\u0915\u094d\u0930\u0935\u093e\u0930 26 \u092e\u0908 \u0915\u094b \u0905\u0926\u093e\u0932\u0924 \u092e\u0947\u0902 \u091a\u093e\u0930\u094d\u091c\u0936\u0940\u091f \u0926\u093e\u0916\u093f\u0932 \u0915\u0930\u0928\u0947 \u091c\u093e \u0930\u0939\u0940 \u0939\u0948\u0964 \u0907\u0938 \u091a\u093e\u0930\u094d\u091c\u0936\u0940\u091f \u092e\u0947\u0902 \u0909\u0928 9 \u0906\u0930\u094b\u092a\u093f\u092f\u094b\u0902 \u0915\u0947 \u0928\u093e\u092e \u092d\u0940 \u0936\u093e\u092e\u093f\u0932 \u0939\u094b \u0938\u0915\u0924\u0947 \u0939\u0948\u0902, \u091c\u094b \u092b\u093f\u0932\u0939\u093e\u0932 \u092a\u0941\u0932\u093f\u0938 \u0915\u0940 \u0917\u093f\u0930\u092b\u094d\u0924 \u092e\u0947\u0902 \u0939\u0948\u0902\u0964\u0026amp;nbsp;"
}

Can we do something so that i get proper UTF-8 back as available in raw HTML?

westonruter commented 11 months ago

Is there a specific reason other than wanting to not see encoded characters in the output?

I believe the code responsible for this is is the MinifyHtml transformer:

https://github.com/ampproject/amp-toolbox-php/blob/56c812508f5ebe538036d75cf0c21de094b316d3/src/Optimizer/Transformer/MinifyHtml.php#L254-L278

Note how the json_encode() call lacks JSON_UNESCAPED_UNICODE.

tapan-thapa commented 11 months ago
  1. We are into the news business and every other publisher which i know is having "unescaped unicode" content in the view source. I am in the impression that if i don't send "unescaped unicode" content in the body, our SEO ranking may be impacted but i am not sure on this.

  2. Thanks for pointing out to the right part of the code. After adding "JSON_UNESCAPED_UNICODE" in the json_encode, my issue is resolved.

Can we consider this as a permanent change in this library? Or you have any specific reason not to use "JSON_UNESCAPED_UNICODE" in the json_encode function?

westonruter commented 11 months ago

I don't believe there is any reason why JSON_UNESCAPED_UNICODE wasn't used.

@thelovekesh Would you please add that flag and check if it has any unintended side effects?

tapan-thapa commented 11 months ago

Just to update, I am running production with this flag since last 2 days and not faced any production issue.

thelovekesh commented 11 months ago

@westonruter I have tested the AMP plugin with the JSON_UNESCAPED_UNICODE flag and it's working as expected.

westonruter commented 11 months ago

@tapan-thapa Curious about your tech stack. How are you using this package in your project?

tapan-thapa commented 11 months ago

@westonruter We have a "Codeigniter" based project in which i am using this library via composer. I am giving fully baked AMP HTML to this library and getting optimised AMP HTML in return.