erusev / parsedown

Better Markdown Parser in PHP
https://parsedown.org
MIT License
14.74k stars 1.12k forks source link

Enhancement/types refactor for 2.0 #685

Closed aidantwoods closed 5 years ago

aidantwoods commented 5 years ago

Split Parsedown's Blocks and Inlines into separable components backed by an interface—this should for a good basis for the new "extension" model for adding custom behaviours.

The general idea will be that "extensions" can implement their own Blocks and Inlines, as well as their own Configurables to allow the setter-like behaviour we have currently. The core Parsedown class now really just strings everything together, only really depending on the basic Paragraph and Plaintext components directly (we could remove the direct dependence if there is a good case for it though)—everything else can be swapped out by the user, so they could go so far as to pick and choose different Blocks and Inlines from various different sources if they so wish.

Perhaps another change worth mentioning is that parsing of particular components is now completely decoupled from the HTML generation. This means that there is no need (and you should not try!) to traverse the HTML structure produced by a particular component if you want to defer to it. Instead each component can present it's own API which makes sense uniquely to that component, and other code that wants to defer parsing can use that API—but not need to depend on the exact structure HTML that will be produced (which is really very subject to change!). For an example of this, inline Images defer some of their parsing to the Link component https://github.com/aidantwoods/parsedown/blob/8a48dcfe315bc7e37028626646291538dce24d10/src/Components/Inlines/Image.php#L45-L51 and then store that produced Link for later use, where the basic attributes are retrieved https://github.com/aidantwoods/parsedown/blob/8a48dcfe315bc7e37028626646291538dce24d10/src/Components/Inlines/Image.php#L62 and used to generate the image (without directly depending on the HTML that the Link ends up producing). This should hopefully allow a lot of reusability without binding so close to the internal behaviour that things like bugfixes can break dependents (as is the case at the moment).

Includes some general type safety related work (via static analysis).


Closes #440 Fixes #686 Closes #692 Fixes #104 Fixes #694

antonkomarev commented 5 years ago

It will be good to have an ability to parse markdown file and has complete structure of the document as objects without rendering. I need it to analyze data from the markdown (have done some workaround today from v1 and then found your PR).

aidantwoods commented 5 years ago

It will be good to have an ability to parse markdown file and has complete structure of the document as objects without rendering.

@antonkomarev 0398672 adds the ability to get the blocks prior to producing state renderables, but there isn't really a good way to traverse deeper without applying the state to the renderables to build the entire AST of renderables. This to say you can't quite traverse the markdown objects (Components), but you could traverse the object representation of Renderables prior to the HTML string being built if that's what you were after?

antonkomarev commented 5 years ago

I'm trying to traverse thru the markdown list to create models from each line.

Here is some pretty rough example code how it could work with your new implementation. I've just added parse method which returns Renderables.

class ListImporter
{
    private $parentId = null;

    public function __construct(string $raw)
    {
        $elements = (new Parsedown)->parse($raw);

        /** @var Element $element */
        foreach ($elements as $element) {
            if ($element->name() === 'ul') {
                $this->readRootList($element);
            }
        }
    }

    private function readRootList(Element $list) {
        foreach ($list->contents() as $listItem) {
            if ($listItem->name() === 'li') {
                $this->readListItem($listItem);
            }
            $this->parentId = null;
        }
    }

    private function readList(Element $list) {
        foreach ($list->contents() as $listItem) {
            if ($listItem->name() === 'li') {
                $this->readListItem($listItem);
            }
        }
    }

    private function readListItem(Element $listItem)
    {
        foreach ($listItem->contents() as $content) {
            if ($content instanceof Container) {
                $this->readContainer($content);
            } elseif ($content instanceof Element) {
                if ($content->name() === 'p') {
                    $this->readParagraph($content);
                }
                elseif ($content->name() === 'ul') {
                    $this->readList($content);
                }
            }
        }
    }

    private function readParagraph(Element $content)
    {
        foreach ($content->contents() as $node) {
            if ($node instanceof Container) {
                $this->readContainer($node);
            }
        }
    }

    private function readContainer(Container $container)
    {
        foreach ($container->contents() as $item) {
            $title = $item->getStringBacking();

            $listItem = ListItem::query()->create([
                'parent_id' => $this->parentId,
                'title' => $title,
            ]);

            if (is_null($this->parentId)) {
                $this->parentId = $listItem->getKey();
            }
        }
    }
}
antonkomarev commented 5 years ago

And instead of all this bunch of code I was wanted to write something like:

$elements = (new Parsedown)->parse($text);

foreach ($elements as $element) {
    if ($element instanceof UnorderedList) {
        $listItem = ListItem::create([
            'title' => $element->text(),
        ]);

        foreach ($element->items() as $item) {
            ListItem::create([
                'parent_id' => $listItem->getKey(),
                'title' => $item->text(),
            ]);
        }
    }
}

But I suppose it's not planned feature.

aidantwoods commented 5 years ago

I haven't paid too much attention to usability around traversing through elements yet, this is all still WIP :) perhaps you could open this as it's own issue and can discuss a bit more about this specifically there?

aidantwoods commented 5 years ago

cc @PhrozenByte if you'd be so kind as to review this :)

aidantwoods commented 5 years ago

@PhrozenByte not sure if you had any time to take a look at this yet? No worries if not :)

PhrozenByte commented 5 years ago

@PhrozenByte not sure if you had any time to take a look at this yet? No worries if not :)

Oh, sorry, completely missed your review request :disappointed: Will look into this as soon as possible, hope I'll find some time on Saturday.

aidantwoods commented 5 years ago

Remaining progress:

I'm currently taking a look at implementing ParsedownExtra's features in terms of this new codebase—once I can do that I think this'll be pretty much ready to go (sans any fixes/concerns). I think I'll need to add something here to cope with abbreviation expansions (since they don't have a strict marker as such), my current line of thought for this a some concept of unmarked inlines can can be parsed before text is consumed as plaintext.

There is probably some room for improvement on the niceties of importing behaviour from an "extension" or many.

I'm not yet decided whether ParsedownExtra's features should live in core gated under optional non-default behaviour (as alluded to in https://github.com/erusev/parsedown/issues/266#issuecomment-399073269), or whether it should remain a separate extension. On the one hand, it's nice to have a single code-base (and people always seem to report ParsedownExtra issues here anyway), though it's also nice to have a reference extension.

aidantwoods commented 5 years ago

I'm going to merge this into the 2.0.x branch so further changes to this can be made in dedicated PRs (after over 200 commits here things are getting a little lost 😄). I'll open a separate PR to merge 2.0.x into master when changes are finalised. Please feel free to still add review comments to this PR though :)