aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://aantron.github.io/lambdasoup
MIT License
383 stars 31 forks source link

Doctypes other than HTML5 aren't emitted correctly #37

Closed dmbaturin closed 4 years ago

dmbaturin commented 4 years ago

I'm sorry for not giving #32 proper testing!

Your implementation only works for the trivial HTML5 doctype. This makes it impossible to keep the real original doctype for any legacy pages.

Example:

$ cat templates/main.html 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en">
  <head>
...

$ utop
utop # #require "lambdasoup";;
utop # Soup.read_file "templates/main.html" |> Soup.parse |> Soup.pretty_print;;
- : string =
"<!DOCTYPE html><html lang=\"en\">\n <head>\n  
aantron commented 4 years ago

This is because Markup's HTML serializer follows the spec here, which allows these doctypes only:

That's obviously too strict. I think this is best solved by actually stripping off the doctype if it is present, rather than sending it to Markup, and outputting it directly from Lambda Soup. If someone later wants legacy doctype support in HTML in Markup, we can move the support there and make Lambda Soup use that, as an internal change.

aantron commented 4 years ago

Also note that XHTML should probably be parsed using the XML parser in Markup, and serialized with the XML serializer. That should have precluded most of these issues, which arise from parsing XML with an HTML5 parser and serializing it with an HTML5 serializer.

I understand that the XML parser might not be as convenient, because you would have to use Soup.from_signals and Soup.signals to plug it into Lambda Soup.

aantron commented 4 years ago

Given that, how would you like to proceed?

Are you facing a mixture of documents, some of which are HTML5 and some XML, and need some kind of easy way of Markup/Lambda Soup automatically figuring out which parser and serializer to use? I haven't considered that before, but it might be the right thing to do.

dmbaturin commented 4 years ago

It's an interesting question. It's not a problem I'm having personally, so it's mostly a theoretical issue. I suppose maybe we should leave it as is until a person who actually has that problem comes along.

aantron commented 4 years ago

Can you test again with opam pin add markup --dev-repo?

I modified Markup's HTML writer to output whatever doctype is there, instead of trying to output only HTML5 doctypes. This seems reasonable, given that there is Markup.html5 already available for the user to force an HTML5 doctype, so we should trust whatever doctype is present in the signal stream.

I also added a test to Lambda Soup for round-tripping the doctype from this issue. It passes with Markup master and fails with Markup 1.0.0 release.

See the two commits linked above if curious about the code.

If this works for you, I will release it in Markup 1.1.0. Lambda Soup won't need a release.

Note that Markup master now requires Dune 2.7.0, due to its dune files using the syntax for the new Bisect_ppx integration. If that's too high a constraint, I can downgrade it. I was hoping that the next release of Markup wouldn't be so soon, and there would be time for Dune 2.7.0 to become pretty old and established :)

dmbaturin commented 4 years ago

I've just tested it and it seems to work fine. Thanks for the fix!

I actually have both markup and lambdasoup dependencies specified explicitly in soupault.opam due to some previous situation, so I don't mind. I don't mind dune 2.7.0 either.

aantron commented 4 years ago

I'll give it a day in case anything else comes to mind, as the doctype is proving to be a rich source of problems for Lambda Soup and Markup.ml :P

dmbaturin commented 4 years ago

Well, I'm working on some other improvements for soupault 2.1.0 too, I wasn't planning to release it right today either. ;)

One visual oddity: traditionally people put a newline between the doctype and the <html> tag when writing HTML by hand, but markup.ml doesn't. That's purely cosmetic of course.

aantron commented 4 years ago

Can you open an issue on Markup.ml about this and any other such things? I think we can fix this in the pretty printer (I am assuming based on the code at the beginning of this issue, that you are using it).

aantron commented 4 years ago

And ping me if/when you would like a release. I'll wait a bit to hopefully include several changes, and then do it.

dmbaturin commented 4 years ago

https://github.com/aantron/markup.ml/issues/58

I need to sit with the Markup.ml code to actually understand it so that I can contriubute in the future...

dmbaturin commented 4 years ago

@aantron I've completed my work on changes I want in soupault 2.1.0, so I'm ready to release it whenever the new Markup release is available.

aantron commented 4 years ago

I plan to do https://github.com/aantron/markup.ml/issues/59 and retag 1.0.0 and release it in the first half of the coming week.

dmbaturin commented 4 years ago

Yeah, I've seen that issue. Whenever the re-tagged markup 1.0.0 is available from the OPAM repos, I'll set the dependency to markup { >= 1.0.0 } and tag the release.