GravityPDF / querypath

A fork of QueryPath: a PHP library for HTML(5)/XML querying (CSS 4 or XPath) and processing (like jQuery) with PHP8.3 support
Other
21 stars 2 forks source link

ZeroWidth character with UTF8 encoding causes weird output #5

Closed jakejackson1 closed 1 year ago

jakejackson1 commented 1 year ago

From querypath created by nebojsac: technosophos/querypath#200

tl;dr: Zero width character becomes ​ in output

Hi there,

I was able to create a minimal test case to repeat the issue with PHPUnit:

<?php
    public function testQueryPathIssue()
    {
        $html = '<html><head></head><body>​Hello!</body>';
        require_once(APP . 'Vendor' . DS. 'QueryPath'. DS . 'qp.php');
        $qpOptions = [
            'convert_from_encoding' => 'UTF-8',
            'convert_to_encoding' => 'UTF-8',
            'strip_low_ascii' => FALSE,
        ];
        $qp = htmlqp($html, NULL, $qpOptions);
        // result is:
        /*
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head></head>
<body>&acirc;&#128;&#139;Hello!</body>
</html>
         */

    }

As you see, it replaces the zero-width character with &acirc;&#128;&#139; - is this normal? It has similar results with odd quote marks, like this character:

It's not a show stopping issue. We're working around it by running this on the html before using QueryPath on it:

        $html = str_replace("\xE2\x80\x8B", "", $html);
        $html = str_replace(["’", "‘"], ["&lsquo;", "&rsquo;"], $html);

Maybe that helps someone else. Is this an issue with QueryPath, PHP, or the encoding? The issue does not happen if I remove the convert_from_encoding and convert_to_encoding parameters.