Refactor interpreter into working over a separate HIR and AST

CosmicHorrorDev commented 11 months ago

Having been working with the interpreter more it's clear that the current design is pretty painful to work with in almost every way. It currently works by doing a pass over the html5ever tokenstream that stores a pretty minimal amount of state while computing the elements for output on the fly. In theory this is very efficient because we can stream our output and we retain a pretty minimal amount of past state, but it also translates to a soup of conditionals that are incredibly difficult to refactor without regressing other behavior.

This also expresses itself where most of the bugs for things lie in not properly keeping track of enough history. Take for instance this markdown snippet

1. First list item

    Some nested text

2. Second list item

This should render as follows:

First list item

Some nested text
Second list item

but the actual HTML for this is pretty complex

<ol>
    <li>
        <p>First list item</p>
        <p>Some nested text</p>
    </li>
    <li>
        <p>Second list item</p>
    </li>
</ol>

There's quite a lot involved in rendering this right: you've got to keep track of which list item you're in, the first element in the list item (<li>) gets the line with the list prefix while the rest get indented further on separate lines, etc. Unfortunately this shows as we do a pretty sloppy job of rendering most things that are nested even semi-complexly

I think the better way forward would be to first convert the raw HTML to an HIR (higher intermediate representation or HTML intermediate representation :grin:) which is just a nice full structured format describing the things we care about in the HTML doc (basically a more complete version of Element). Then from there we can more easily do the analysis we need to correctly render things above into our output AST which represents how thing are laid out in inlyne's eyes

kokoISnoTarget commented 7 months ago

So what things would we need to do? (Because this sounds like a major rewrite) And what would the HIR contain?

CosmicHorrorDev commented 7 months ago

So for starters the AST would likely remain a VecDeque<crate::Element>. Each element in the HIR would contain all of the crate::interpreter::html::Attrs that we recognize along with the inner runs of text and other elements. For something like

<p>In a paragraph <a href="https://example.org">https://example.org</a></p>

would get represented as something like

use crate::interpreter::html::{attr::Align, Attr, TagName};

enum HirOrText {
    Hir(Hir),
    Text(String),
}

vec!Hir {
    tag: TagName::Paragraph,
    attrs: vec![],
    contents: vec![
        HirOrText::Text("In a paragraph ".into()),
        HirOrText::Hir(Hir {
            tag: TagName::Anchor,
            attrs: vec![Attr::Href("https://example.org".into())],
            contents: vec![HirOrText::Text("https'//example.org".into())]
        }),
        ...
    ],
}

That would be the main change as the existing code for the AST can be re-written to run over the HIR without having to change much of the logic there (although it opens the door to more freely change things in the future)

nicoburns commented 6 months ago

You may be interested in the architecture of https://github.com/DioxusLabs/blitz which parses HTML into a dom-like structure before doing style then layout then rendering passes. It's a more ambitious project as we aim to actually render arbitrary HTML (including CSS but excluding JavaScript), but it might make sense for inlyne too.

(once Blitz is a bit more developed you could also straight-up build inlyne of it / help contribute if that's an approach you were interested in)

CosmicHorrorDev commented 6 months ago

@kokoISnoTarget is currently working on things with #318, so I'll leave the immediate direction of where things go up to them, but yes I am very interested in sharing more of the nitty gritty HTML bits

kokoISnoTarget commented 6 months ago

(once stylo-dioxus is a bit more developed you could also straight-up build inlyne of it / help contribute if that's an approach you were interested in)

It isn't ready, so nothing I would decide on now. But sounds promising.

nicoburns commented 5 months ago

We've just landed inline / inline-block support in Blitz (https://github.com/DioxusLabs/blitz). It's still quite early / immature, and we don't have feature parity with inlyne (missing things like text selection, table layout, and hot reloading) but it's now in a state that inlyne devs could look into using / contributing to / building on it if you were interested. This is a screenshot of blitz rendering it's own README (the markdown example in the repository):

We have a full HTML style/layout engine so this example is using comrak to compile the README.md file to HTML and applying a stylesheet from https://github.com/sindresorhus/github-markdown-css (theming is just CSS).

CosmicHorrorDev commented 5 months ago

Thanks for the update! Things are looking very nice

I think my only big concern with migrating to using something like blitz is that I don't know if there will be a nice layer of abstraction to describe the parts that inlyne specifically needs in terms of both 1) the low-level of control we need, and 2) the loss of control over a large chunk of our dependencies

Both of those are pretty crucial to trying to maintain being a lightweight markdown viewer (although that's still more of a future goal than the current state). Maybe it's better at this point for us to steal different things from each other as features land. It's also totally possible that blitz's development will far outpace ours and maybe in the future it'd be silly for us to not use it :shrug:

Inlyne-Project / inlyne

Refactor interpreter into working over a separate HIR and AST #175