Open simonh1000 opened 1 year ago
Thank you for reporting this!
Yes, this doctype format is definitely missing from the parser: https://github.com/danneu/html-parser/blob/45e2e52f9d459047ad1f86a062a2127563073dfe/src/Html/Parser.elm#L162-L198
Not sure how I missed that, but it does need to be fixed.
Right now the parser expects an html5 doctype to exist and then it consumes it. So if it were extended to support other doctypes (like html5 and xhtml), you'd probably want to be able to inspect what the doctype was.
Maybe I can change:
type alias Document =
{ legacyCompat : Bool
, root : Node
}
to something like:
type alias Document =
{ doctype: Html5 | Html5Legacy | Other String
, root : Node
}
or even:
type alias Document =
{ doctype: Html5 | Html5Legacy | Html4 | Xhtml | Other String
, root : Node
}
I got this string from a rich text paste from the MacOS TextEdit app. Its basically a fully html page The problem is that the very first line gets converted to
Text "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">\n"