danneu / html-parser

a lenient html5 parser written in Elm
MIT License
5 stars 0 forks source link

DOCTYPE should not be parsed as Text #9

Open simonh1000 opened 1 year ago

simonh1000 commented 1 year ago

I got this string from a rich text paste from the MacOS TextEdit app. Its basically a fully html page The problem is that the very first line gets converted to Text "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">\n"

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="Content-Style-Type" content="text/css">
<title></title>
<meta name="Generator" content="Cocoa HTML Writer">
<meta name="CocoaVersion" content="2299.7">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 18.0px Helvetica}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica}
p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px}
</style>
</head>
<body>
<p class="p1">This is a title</p>
<p class="p2">This is some text</p>
</body>
</html>
danneu commented 1 year ago

Thank you for reporting this!

Yes, this doctype format is definitely missing from the parser: https://github.com/danneu/html-parser/blob/45e2e52f9d459047ad1f86a062a2127563073dfe/src/Html/Parser.elm#L162-L198

Not sure how I missed that, but it does need to be fixed.

danneu commented 1 year ago

Right now the parser expects an html5 doctype to exist and then it consumes it. So if it were extended to support other doctypes (like html5 and xhtml), you'd probably want to be able to inspect what the doctype was.

Maybe I can change:

type alias Document =
    { legacyCompat : Bool
    , root : Node
    }

to something like:

type alias Document =
    { doctype: Html5 | Html5Legacy | Other String
    , root : Node
    }

or even:

type alias Document =
    { doctype: Html5 | Html5Legacy | Html4 | Xhtml | Other String
    , root : Node
    }