From querypath created by nebojsac: technosophos/querypath#200
tl;dr: Zero width character becomes ​ in output
Hi there,
I was able to create a minimal test case to repeat the issue with PHPUnit:
<?php
public function testQueryPathIssue()
{
$html = '<html><head></head><body>Hello!</body>';
require_once(APP . 'Vendor' . DS. 'QueryPath'. DS . 'qp.php');
$qpOptions = [
'convert_from_encoding' => 'UTF-8',
'convert_to_encoding' => 'UTF-8',
'strip_low_ascii' => FALSE,
];
$qp = htmlqp($html, NULL, $qpOptions);
// result is:
/*
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head></head>
<body>​Hello!</body>
</html>
*/
}
As you see, it replaces the zero-width character with ​ - is this normal?
It has similar results with odd quote marks, like this character: ’
It's not a show stopping issue. We're working around it by running this on the html before using QueryPath on it:
Maybe that helps someone else. Is this an issue with QueryPath, PHP, or the encoding?
The issue does not happen if I remove the convert_from_encoding and convert_to_encoding parameters.
From querypath created by nebojsac: technosophos/querypath#200
tl;dr: Zero width character becomes
​
in outputHi there,
I was able to create a minimal test case to repeat the issue with PHPUnit:
As you see, it replaces the zero-width character with
​
- is this normal? It has similar results with odd quote marks, like this character:’
It's not a show stopping issue. We're working around it by running this on the html before using QueryPath on it:
Maybe that helps someone else. Is this an issue with QueryPath, PHP, or the encoding? The issue does not happen if I remove the
convert_from_encoding
andconvert_to_encoding
parameters.