Closed yannham closed 7 years ago
Markup.ml drops frame
elements in body
. This behavior is compliant with the HTML5 specification (search for text "frame"
, including quotes, in that section – sorry, the link is the closest anchor I could find):
8.2.5.4.7 The "in body" insertion mode
-> A start tag whose tag name is one of: "caption", "col", "colgroup", "frame", "head",
"tbody", "td", "tfoot", "th", "thead", "tr"
Parse error. Ignore the token.
I inspected the DOM in Chrome, and Chrome likewise dropped the frame
elements from both the body and the table.
Do you mean iframe
? frame
is only allowed inside frameset
. AFAIK there is also no frames
tag in HTML.
If you will be parsing bad HTML, you might want to do this for easier debugging:
let report location error =
prerr_endline (Markup.Error.to_string ~location error) in
let page =
Markup.(file "index.html" |> fst |> parse_html ~report |> signals)
|> Soup.from_signals
in
ignore (page $ "frame")
(* ... *)
This shows the errors:
line 45, column 9: misnested tag: 'frame' in 'table'
line 50, column 5: misnested tag: 'frame' in 'body'
Indeed, showing Markup.ml errors reveals several other problems with the markup:
title
element should be inside head
(outside head
, it silently creates a head
element, and then there is an explicit head
element, which is an error).img
or frames
tag at the top level of a table.If you want to disregard HTML rules by the way, you may be able to get by by parsing as XML:
let page =
Markup.(file "index.html" |> fst |> parse_xml |> signals)
|> Soup.from_signals
in
But in HTML mode, what gets parsed corresponds to what browsers accept and users actually see (modulo any bugs lurking in Markup.ml).
I see, it makes sense now ! I don't know why I assumed it would be parsed as XML by default. Thanks for the answer
Sure. Amendment: to parse HTML as XML you should probably translate HTML entities as well:
Markup.(parse_xml ~entity:xhtml_entity)
Oh, and if you want a simple command-line tool for showing Markup.ml errors (i.e. everything in the syntax section of the HTML spec), @fxfactorial made valentine for this purpose a while back :)
Seems cool, I'll take a look !
I ran into a strange problem when trying to write a small scraping library with lambdasoup. On my simple HTML test file, lambdasoup doesn't seem to be able to select the frame markup. The page seems to be at least valid xml (I may not respect some HTML markup usage constraints).
let page = Soup.read_file "index.html" |> Soup.parse;; page $ "frame";;
gives in utop
"Exception: Failure "Soup.($): cannot select 'frame'"."
while selecting anything else like img, form, frameS, ul, li, div, etc. is working fine. I'm using ocaml 4.03.0 with lambdasoup 0.6.1. You can find my test page here : yago.gb2n.org/test-lambdasoup.html