egonSchiele / HandsomeSoup

Easy HTML parsing for Haskell
http://egonschiele.github.com/HandsomeSoup
BSD 3-Clause "New" or "Revised" License
124 stars 20 forks source link

Content missing for some websites. #24

Open bobjflong opened 10 years ago

bobjflong commented 10 years ago

The following URL doesn't seem to load any actual content (just metadata). Some pages on that site seem fine. Any idea what's up?

λ let url = "http://www.goodreads.com/book/show/301538.The_Darkness_That_Comes_Before?from_search=true"

λ runX $ fromUrl url
[NTree (XTag "/" [NTree (XAttr "http-Content-Length") [NTree (XText "386810") []],NTree (XAttr "http-Transfer-Encoding") [NTree (XText "chunked") []],NTree (XAttr "http-Set-Cookie") [NTree (XText "_session_id2=82884d397b7fcd985680433233ba3154; path=/; expires=Fri, 22-Aug-2014 04:20:14 GMT; HttpOnly") []],NTree (XAttr "http-X-Runtime") [NTree (XText "1.612029") []],NTree (XAttr "http-Cache-Control") [NTree (XText "max-age=0, private, must-revalidate") []],NTree (XAttr "http-ETag") [NTree (XText "\"d5ff33fa33ea6cd6c3f85076da8e4132\"") []],NTree (XAttr "http-X-UA-Compatible") [NTree (XText "IE=Edge,chrome=1") []],NTree (XAttr "http-X-Request-Id") [NTree (XText "0VR3CZ02NQRRFSJK9KT3") []],NTree (XAttr "http-Vary") [NTree (XText "User-Agent,Accept-Encoding") []],NTree (XAttr "http-Status") [NTree (XText "200 OK") []],NTree (XAttr "http-Content-Type") [NTree (XText "text/html; charset=utf-8") []],NTree (XAttr "transfer-Encoding") [NTree (XText "UTF-8") []],NTree (XAttr "transfer-MimeType") [NTree (XText "text/html") []],NTree (XAttr "http-Server") [NTree (XText "Server") []],NTree (XAttr "http-Date") [NTree (XText "Thu, 21 Aug 2014 22:20:14 GMT") []],NTree (XAttr "transfer-Version") [NTree (XText "HTTP/1.1") []],NTree (XAttr "transfer-Message") [NTree (XText "OK") []],NTree (XAttr "transfer-Status") [NTree (XText "200") []],NTree (XAttr "transfer-URI") [NTree (XText "http://www.goodreads.com/book/show/301538.The_Darkness_That_Comes_Before?from_search=true") []],NTree (XAttr "source") [NTree (XText "http://www.goodreads.com/book/show/301538.The_Darkness_That_Comes_Before?from_search=true") []]]) []]
egonSchiele commented 10 years ago

That's weird. Those are all the http headers, and content length is 386810, which means the whole page is being sent. Not sure where the body of the response is going.

bobjflong commented 10 years ago

I tried a workaround like this but looks like there's something up with parsing that document:

λ import Network.HTTP

λ html <- simpleHTTP (getRequest url) >>= getResponseBody
-- html looks correct

λ runX $ parseHtml html >>> css "span"
[]