UTF-8 Encoding not being respected

jakejackson1 commented 1 year ago

From querypath created by sylus: technosophos/querypath#94

Hey @technosophos,

Before going into my issue just wanted to say I love your work on QueryPath!

As for the issue I was wondering if you would have any advice on what I could be doing wrong and why QueryPath seems to be ignoring the fact that a string is valid UTF-8.

<?php
      // Parse the HTML using QueryPath
      $qp_options = array(
        'convert_from_encoding' => 'UTF-8',
        'convert_to_encoding' => 'UTF-8',
        'strip_low_ascii' => FALSE,
      );

      //Taxonomy
      $this->qp = htmlqp($dbRow->BreadCrumbHTML, NULL, $qp_options);
      $taxonomy = $this->qp->top()->find('ul li:last')->text();

Where the content of $dbRow->BreadCrumbHTML is:

<ul><li style="display:inline;"><a href="/fr/index.html">Accueil</a></li> &gt; <li><a href="/fr/roads_trans/index.html">Routes et transports</a></li>  &gt; <li>Vélo</li></ul>

and the string I get returned for $taxonomy is:

"VÃ©lo"

If I don't use querypath and just get the whole text the UTF-8 is maintained. I did also check to make sure mb_convert_encoding is being called and it does work and maintain the UTF-8 Encoding at that point in xdebug (PHP 5.3.9). Would you have any sagely advice on this on particular routes to further debug?

jakejackson1 commented 1 year ago

I found a temporary fix by using the following function and wraping a HTML Stub around $taxonomy... Curious why this makes things work?

<?php
  protected function wrapHTML() {
    // We add surrounding <html> and <head> tags.
    $html = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">';
    $html .= '<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>';
    $html .= $this->html;
    $html .= '</body></html>';
    $this->html = $html;
  }

jakejackson1 commented 1 year ago

When you use htmlqp(), QueryPath (libxml, actually) tries to repair anything that looks broken in an HTML document. Since you are passing a fragment of HTML, it tries to repair it by creating the <html><head/><body/></html> parts. My guess is that when libxml2 does this, it changes the character encoding to ISO-8859-1, which is its preferred character set.

When you wrap it, you keep that fixing stuff from firing.

There are several ways of working around this, but the method you have discovered works just fine.

jakejackson1 commented 1 year ago

I found a workaround at http://php.net/manual/en/domdocument.loadhtml.php#95251 It works well for me.

  $doc = new DOMDocument();
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  foreach ($doc->childNodes as $item) {
    if ($item->nodeType == XML_PI_NODE) {
      $doc->removeChild($item);
    }
  }
  $doc->encoding = 'UTF-8';

  $qp = qp($doc);

jakejackson1 commented 1 year ago

I find a way,it work fine in UTF-8

mb_convert_encoding(htmlqp($path,"body")->find("h2")->text(),"ISO-8859-1","UTF-8");

jakejackson1 commented 1 year ago

This should definitely be something that's put in QueryPath. I've got a lot of UTF-8 encoded data that I'm working with that I have to perform the workaround on.

To answer you question: The encoding declaration is the only requirement here. I'm able to reproduce this against a lot of datapoints. $doc->encoding is not required whenever I use this.

jakejackson1 commented 1 year ago

Fascinating. That's something I should add to QueryPath. Do you need both the encoding declaration in loadHTML and the $doc->encoding at the end? Or does just the last one do the trick?

jakejackson1 commented 1 year ago

really helps!!!!finally works with chinese in utf8.Thanks for your work

jakejackson1 commented 1 year ago

Are people experiencing this problem using html5() instead of html()?

jakejackson1 commented 1 year ago

@technosophos html5() seems to work

header("Content-Type: text/plain;");
$kw = "Водка";
$html = "<div>Водка</div>";
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
$doc->encoding = 'UTF-8';

$qp = htmlqp($html);
echo "html(): ";
print_r($qp->find('div:contains('.$kw.')')->size()."\n");

$qp = htmlqp($doc);
echo "html() on DOMDocument: ";
print_r($qp->find('div:contains('.$kw.')')->size()."\n");

$qp = html5qp($html);
echo "html5(): ";
print_r($qp->find('div:contains('.$kw.')')->size()."\n");

GravityPDF / querypath

UTF-8 Encoding not being respected #18