danneu / html-parser

a lenient html5 parser written in Elm
MIT License
3 stars 0 forks source link

elm-html-parser

A lenient html5 parser implemented with Elm.

A lenient alternative to hecrj/elm-html-parser.

Experimental: Also contains undocumented, unpublished, work-in-progress node tree traversal, query, and transformation in Loc.elm using a Zipper data-structure.

Usage

import Html.Parser

"<p class=greeting>hello <strong>world</strong></p>"
|> Html.Parser.run Html.Parser.allCharRefs
-- Ok
--     [ Element "p" [ ("class", "greeting") ]
--          [ Text "hello "
--          , Element "strong" [] [ Text "world" ]
--          ]
--     ]

Rendering:

Goals

Features / Quirks

Differences from existing packages

Currently, there is only one html parser published to Elm packages: hecrj/elm-html-parser.

@hecjr has said that following the html5 spec is a goal of their parser, so their parser is stricter by design and rejects invalid html5.

Development

git clone and npm install.

Technical notes

(pre-v2.0.1) Parsing text

Note: This talks about the text parser pre-v2.0.1. Scroll to the next subheader to read about what changed.

One source of parser complexity is text.

Text in lenient html is basically "anything that wasn't parsed by the other parsers."

This means that you can't have a simple parser like:

parser : Parser Node
parser =
    oneOf
        [ element
        , comment
        , text
        ]

Because how would you define the text parser that doesn't underconsume ("parse anything until '<'") nor overconsume?

The best way I can think of accomplishing this with elm/parser is to, inside a loop, try all of your other parsers and then, if they all fail, consume a single character before looping again.

Something like this:

parser : Parser (List Node)
parser =
    loop [] <|
        \acc ->
            oneOf
                [ element |> map (\node -> Loop (node :: acc))
                , comment |> map (\node -> Loop (node :: acc))
                , chompIf (\_ -> True)
                    |> map (Text << String.fromChar)
                    |> map (\node -> Loop (node :: acc))
                , succeed ()
                    |> map (\_ -> (Done (List.reverse acc)))
                ]

It's not nice and simple anymore. And since it's not possible to make an exhaustive text parser, I've had to repeat this kind of logic in various places.

(v2.0.1) Parsing text

The text parser was changed from v2.0.0 to v2.0.1 to be stand-alone meaning that if you apply the text parser, it will return a text node that consumes text up until the next non-text node can be parsed, which is pretty useful and DRYs up some text parsing.

The basic idea is to use lookahead like this:

text : Parser Node
text =
    loop "" <| \acc ->
        oneOf
            [ -- Positive cases for text
              -- We want to be as greedy as possible here to avoid lookahead (slow).
              succeed (\s -> Loop (acc ++ s))
                |= characterReference
            , chompOneOrMore (\c -> c /= '<' && c /= '&')
                |> getChompedString
                |> map (\s -> Loop (acc ++ s))

            -- Now we use lookahead to enumerate negative cases.
            -- Notice how the "match until '<'" case above tries
            -- to avoid ever reaching here.
            , succeed (Done acc)
                |. lookAhead (openTag cfg) -- Starts with "<"
            , succeed (Done acc)
                |. lookAhead anyCloseTag -- Starts with "<"
            , succeed (Done acc)
                |. lookAhead comment -- Starts with "<"

            -- Finally, if we get here, we always consume one character.
            , succeed (\s -> Loop (acc ++ s))
                |= justOneChar
            , succeed (Done acc)
            ]

Between v2.0.1 -> v2.0.2, I realized how important it is to enumerate positive cases so that lookahead is minimized. For example, adding the chompOneOrMore bit above improved my benchmark 700x.

Special thanks