flychen50 / phpquery

Automatically exported from code.google.com/p/phpquery
0 stars 0 forks source link

HEAD encoding problem #80

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
some files will wrongly encoded to utf-8, even if they are already..
this is a problem with php DOM. i found an easy workaround for that issue:

see here (in the comments):
http://de2.php.net/manual/en/domdocument.loadhtml.php

$html=str_replace("<head>", '<head><meta http-equiv="Content-Type"
content="text/html; charset=utf-8"/>', $html);

it must be the *very first* element in HEAD
after that a DOM can be created from this fixed htmlstring.

dunno, if you should include this, but a big warning should be made about
this isssue.

aehm.. some websites have BOMs .. that should be removed also ..

Original issue reported on code.google.com by robbie.w...@gmail.com on 18 Nov 2008 at 11:16

GoogleCodeExporter commented 9 years ago
You're right, content-type meta tag should be before any encoded character. To 
fix
this in all cases, content-type must be moved at front of HEAD. Adding 
content-type
when there isn't any is already implemented.

BOM is other issue and should be discussed separately.

Original comment by tobiasz....@gmail.com on 19 Nov 2008 at 9:42

GoogleCodeExporter commented 9 years ago
Fixed in r318. Meta content-type is automatically repositioned as first HEAD 
element.

Original comment by tobiasz....@gmail.com on 7 Dec 2008 at 1:31

GoogleCodeExporter commented 9 years ago

Original comment by tobiasz....@gmail.com on 7 Dec 2008 at 1:37

GoogleCodeExporter commented 9 years ago
Hey all, I found a bug with the fix to this issue. In the function 
charsetFixHTML there is a search for <head> but what if the <head> tag has 
attributes? I found this out and it was bombing the DOM load process. It should 
really use the following regex to add the meta tag after the head tag:

$markup = preg_replace('/(<head[^>]*>)/i',"$1".$metaContentType,$markup);

Original comment by johneth...@gmail.com on 25 Oct 2010 at 8:26

GoogleCodeExporter commented 9 years ago
The problem is still when head tag contains space like this <head >.
I used quickfix
$html = preg_replace('/(<head[^>]*>)/i', '<head>', $html );

Original comment by printe...@gmail.com on 11 Sep 2012 at 6:34

GoogleCodeExporter commented 9 years ago
The fix from John works great. Thanks

Original comment by tie...@dtn.com.vn on 4 Apr 2013 at 3:06

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I also had to use the solution of johneth to make it understand greek language:

$markup = preg_replace('/(<head[^>]*>)/i',"$1".$metaContentType,$markup);

Original comment by impressi...@gmail.com on 8 Jan 2014 at 2:02

GoogleCodeExporter commented 9 years ago
1:
if tag meta before head,there is some error.

$markup = preg_replace('/<head[^>]*?>/sim','<head>',$markup,1,$num);
$headStart = strpos($markup, '<head>');
if($headStart === false)
{
    return;
}
$htmlStart = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>';
$markup = $htmlStart.$metaContentType.substr($markup, $headStart+6);

2:not reg "<meta charset="gb2312" />"

Original comment by feilala...@gmail.com on 6 Mar 2014 at 5:39