aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://aantron.github.io/lambdasoup
MIT License
380 stars 31 forks source link

<frame> markup seems not to be detected #9

Closed yannham closed 7 years ago

yannham commented 7 years ago

I ran into a strange problem when trying to write a small scraping library with lambdasoup. On my simple HTML test file, lambdasoup doesn't seem to be able to select the frame markup. The page seems to be at least valid xml (I may not respect some HTML markup usage constraints).

let page = Soup.read_file "index.html" |> Soup.parse;; page $ "frame";;

gives in utop "Exception: Failure "Soup.($): cannot select 'frame'"."

while selecting anything else like img, form, frameS, ul, li, div, etc. is working fine. I'm using ocaml 4.03.0 with lambdasoup 0.6.1. You can find my test page here : yago.gb2n.org/test-lambdasoup.html

aantron commented 7 years ago

Markup.ml drops frame elements in body. This behavior is compliant with the HTML5 specification (search for text "frame", including quotes, in that section – sorry, the link is the closest anchor I could find):

8.2.5.4.7 The "in body" insertion mode
-> A start tag whose tag name is one of: "caption", "col", "colgroup", "frame", "head",
   "tbody", "td", "tfoot", "th", "thead", "tr"
     Parse error. Ignore the token.

I inspected the DOM in Chrome, and Chrome likewise dropped the frame elements from both the body and the table.

Do you mean iframe? frame is only allowed inside frameset. AFAIK there is also no frames tag in HTML.

If you will be parsing bad HTML, you might want to do this for easier debugging:

let report location error =
  prerr_endline (Markup.Error.to_string ~location error) in
let page =
  Markup.(file "index.html" |> fst |> parse_html ~report |> signals)
  |> Soup.from_signals
in
ignore (page $ "frame")
(* ... *)

This shows the errors:

line 45, column 9: misnested tag: 'frame' in 'table'
line 50, column 5: misnested tag: 'frame' in 'body'

Indeed, showing Markup.ml errors reveals several other problems with the markup:

aantron commented 7 years ago

If you want to disregard HTML rules by the way, you may be able to get by by parsing as XML:

let page =
  Markup.(file "index.html" |> fst |> parse_xml |> signals)
  |> Soup.from_signals
in

But in HTML mode, what gets parsed corresponds to what browsers accept and users actually see (modulo any bugs lurking in Markup.ml).

yannham commented 7 years ago

I see, it makes sense now ! I don't know why I assumed it would be parsed as XML by default. Thanks for the answer

aantron commented 7 years ago

Sure. Amendment: to parse HTML as XML you should probably translate HTML entities as well:

Markup.(parse_xml ~entity:xhtml_entity)
aantron commented 7 years ago

Oh, and if you want a simple command-line tool for showing Markup.ml errors (i.e. everything in the syntax section of the HTML spec), @fxfactorial made valentine for this purpose a while back :)

yannham commented 7 years ago

Seems cool, I'll take a look !