erusev / parsedown

Better Markdown Parser in PHP
https://parsedown.org
MIT License
14.74k stars 1.12k forks source link

Document traversing #694

Closed antonkomarev closed 5 years ago

antonkomarev commented 5 years ago

This is description of the possible feature which was described in #685 initially.

I'm trying to traverse thru the markdown list to create models from each list item. But there is no easy way to do so. I was wanted to write something like this pseudo-code:

$elements = (new Parsedown)->parse($text);

foreach ($elements as $element) {
    if ($element instanceof UnorderedList) {
        foreach ($element->items() as $item) {
            ListItem::create([
                'text' => $item->text(),
            ]);
        }
    }
}

For sure this feature will have a lot of edge cases.

aidantwoods commented 5 years ago

At the moment Element and any sub-types we might create (e.g. UnorderedList) all implement Renderable. In this context things are quite general, and so there isn't really any reason why an ul must contain no text itself for example. This especially so as 2.0 looks to welcome combining new blocks and inlines (Components) from various sources.

That said, 2.0 also decouples the rendering process from parsing—and so the exact types of markdown object that were parsed are known in a structured format. I think this might be a better goal since you could then use the API that the particular markdown object presents in-order to get information about it. I think it therefore might be a better goal to try to traverse the Component structure than that of the eventual Renderables—which really only serves to provide an object representation of HTML (and so has to remain quite general). The Component structure, however, could provide an extremely specific API since it knows a lot more about what it is designed to produce. What do you think?

To compare, here are the currently defined Components: https://github.com/aidantwoods/parsedown/tree/enhancement/types/src/Components (see the blocks and inlines folders contained), whereas here are the currently defined Renderables: https://github.com/aidantwoods/parsedown/tree/enhancement/types/src/Html/Renderables, which are far fewer in number (intentionally as these provide fairly general but distinct rendering capabilities).

At the moment it isn't possible to see "deeper" than the top layer of Components, but perhaps this can be opened up if there is interest?

antonkomarev commented 5 years ago

You are right, it should be better to traverse thru Components and not Renderables. At least because Renderables are designed for rendering purposes and not to finding something.

aidantwoods commented 5 years ago

Some thoughts on this:

Due to different Components having (potentially wildly) different ideas about how their "Contents" would look, I'm not sure there's much to be done with trying to require they define the contents as part of the overarching interface. (e.g. for an Emphasis it's just a collection of Inlines, but for a Table it's a collection of header cells, which are themselves collections of inlines and a series of rows which are each a collection of cells which are each a collection of Inlines).

So I think it probably makes most sense to have the ability to expose the contents as a kind of optional "feature" that is part of the public API for the specific Component. Here's how we might expose the items inside a list:

diff --git a/src/Components/Blocks/TList.php b/src/Components/Blocks/TList.php
index 25d17ab..91e0163 100644
--- a/src/Components/Blocks/TList.php
+++ b/src/Components/Blocks/TList.php
@@ -272,6 +272,20 @@ final class TList implements ContinuableBlock
         return null;
     }

+    /**
+     * @return array{0: Block[], 1: State}[]
+     */
+    public function items(State $State)
+    {
+        return \array_map(
+            /** @return array{0: Block[], 1: State} */
+            function (Lines $Lines) use ($State) {
+                return Parsedown::blocks($Lines, $State);
+            },
+            $this->Lis
+        );
+    }
+
     /**
      * @return Handler<Element>
      */
@@ -288,13 +302,14 @@ final class TList implements ContinuableBlock
                         : []
                     ),
                     \array_map(
-                        /** @return Element */
-                        function (Lines $Lines) use ($State) {
-                            list($StateRenderables, $State) = Parsedown::lines(
-                                $Lines,
-                                $State
-                            );
-
+                        /**
+                         * @param array{0: Block[], 1: State} $Item
+                         * @return Element
+                         * */
+                        function ($Item) {
+                            list($Blocks, $State) = $Item;
+
+                            $StateRenderables = Parsedown::stateRenderablesFrom($Blocks);
                             $Renderables = $State->applyTo($StateRenderables);

                             if (! $this->isLoose
@@ -309,7 +324,7 @@ final class TList implements ContinuableBlock

                             return new Element('li', [], $Renderables);
                         },
-                        $this->Lis
+                        $this->items($State)
                     )
                 );
             }

there's probably more information than you'd like to deal with here: namely passing a State object, and receiving some back. This is needed since some things are State dependant at the parsing stage (e.g. reference links), and you receive State since some things update state as parsing completes (e.g. reference definitions).

From your example, it looks like you want to get the "text" from inside a list item: while this makes sense in simple examples, markdown lists can in-principle contain blocks (e.g. more lists, code blocks, block quotes, ...). If you wanted just text though you can utilise the State to remove everything other than simple Paragraph parsing, and then just merge them all together e.g. roughly something like:

if ($Block instanceof TList) {
    $Items = $Block->items(new State([new BlockTypes([], [])]));

    $Paragraphs = \array_map(function ($I) { return $I[0]; }, $Items);

    // merge paragraphs into single string
    $text = \array_reduce($Paragraphs, function ($text, $P) { return $text . "\n" . $P->text(); }, '');
}

You could of-course retain more complex objects though.

aidantwoods commented 5 years ago

8449ebd4dbe71276d1d3ae9227f0c82d474928db expands the public API of components to include some info about their contents, and metadata in certain circumstances (e.g. infostring of a code block, start number of a list, heading level...). You can take a look at the tests added alongside to see some examples of using each of these public APIs (generally, of-course, you shouldn't make certain assumptions of structure that the tests can get away with though).