www.newsmax.com/m may suck, but it should parse

KawaiiBASIC / classilla

Automatically exported from code.google.com/p/classilla

0 stars 0 forks source link

www.newsmax.com/m may suck, but it should parse #184

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

Reported by Walt. Their mobile site does parse on TenFourFox. On Classilla it 
gets an error:

XML Parsing Error: not well-formed
Location: http://www.newsmax.com/m
Line Number 107, Column 71:<img 
src="C:\inetpub\wwwroot\ProdCMSV3_0\ga.aspx?utmac=UA-31221-1&utmn=2095532446&utm
r=-&utmp=%2fCMSTemplates%2fNewsmax%2fMobileSiteCMS%2fDefault.aspx%3faliaspath%3d
%252fmobilehome%252fDefault&guid=ON" />
----------------------------------------------------------------------^

This is clearly awful XML, but it should parse.

Original issue reported on code.google.com by classi...@floodgap.com on 21 Jan 2012 at 1:37

Attachments:

newsmax.xml

GoogleCodeExporter commented 9 years ago

It looks like it's seeing it as an entity. This is true for XML, but the server 
wants the page identified as HTML, and parsed as HTML this would work. So we 
are sniffing the document wrong.

Original comment by classi...@floodgap.com on 21 Jan 2012 at 1:47

GoogleCodeExporter commented 9 years ago

Trying...
Connected to newsmax.com.
Escape character is '^]'.
GET /m HTTP/1.0
Host: www.newsmax.com
Connection: close

HTTP/1.1 200 OK
Cache-Control: no-cache,private, no-store, must-revalidate
Content-Length: 8077
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/7.0
X-AspNet-Version: 2.0.50727
Set-Cookie: CMSPreferredCulture=en-US; expires=Mon, 21-Jan-2013 01:45:20 GMT; 
path=/
Set-Cookie: ASP.NET_SessionId=d1u3e245ilhwhc550zbeoham; path=/; HttpOnly
X-Powered-By: ASP.NET
X-UA-Compatible: IE=7
Date: Sat, 21 Jan 2012 01:45:19 GMT
Connection: close

Original comment by classi...@floodgap.com on 21 Jan 2012 at 1:47

GoogleCodeExporter commented 9 years ago

<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" 
"http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

Original comment by classi...@floodgap.com on 21 Jan 2012 at 1:48

GoogleCodeExporter commented 9 years ago

Current suspect: htmlparser/src/nsParser.cpp:DetermineParseMode

We'll throw a breakpoint in there when we're ready to debug this.

Original comment by classi...@floodgap.com on 21 Jan 2012 at 2:51

GoogleCodeExporter commented 9 years ago

Actually, the MIME type detect is not failing, because newsmax declares itself 
as XML:

<!-- Mobile Meta Tags -->
    <meta http-equiv="Content-type" content="application/xhtml+xml; charset=utf-8" />

The only way around this is to relax the parser. Yuck.

Original comment by classi...@floodgap.com on 1 Feb 2012 at 2:37

GoogleCodeExporter commented 9 years ago

Altering expat so that XML_TOK_INVALID parses leads to "success" but holes in 
the page.

Maybe the simplest way is just to force application/xhtml+xml to be parsed as 
HTML. This is wrong, but no more wrong than other hacks we do.

Original comment by classi...@floodgap.com on 1 Feb 2012 at 3:26

GoogleCodeExporter commented 9 years ago

This is what we did, and now the site works.

Let's see if this breaks anything.

Original comment by classi...@floodgap.com on 19 Feb 2012 at 5:07

Changed state: Started

GoogleCodeExporter commented 9 years ago

It breaks about: (since about: needs to be parsed as xhtml). Maybe we add an 
exception for this.

Original comment by classi...@floodgap.com on 4 Mar 2012 at 2:47

GoogleCodeExporter commented 9 years ago

Implemented better solution from issue 189: fudge content types in 
HttpChannel::ProcessNormal(). Since about: is loaded from jar:, it will not get 
its content type changed, and is parsed as proper XHTML. Since this loads from 
the network, it will.

Original comment by classi...@floodgap.com on 5 Mar 2012 at 12:37

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 19 Oct 2012 at 4:49

Changed state: Verified