Parsedown: get all image links

erusev / parsedown

Better Markdown Parser in PHP

https://parsedown.org

MIT License

14.69k stars 1.12k forks source link

Parsedown: get all image links #726

Closed MarkMessa closed 4 years ago

MarkMessa commented 4 years ago

Is it possible to get all image links parsed by Parsedown?
I'm considering something like:

$Parsedown = new Parsedown();
$file = file_get_contents('filename.txt');
echo $Parsedown->text($file);

# output
image1.png
image2.png

filename.txt

![][image1]
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam porttitor nulla id luctus hendrerit.

![](image2.png)
Integer sed ultricies ante, sed mattis mauris. Donec et nisl sapien. 

[image1]: image1.png

taufik-nurrohman commented 4 years ago

Hook to the inlineImage method and capture all src value to a public property:

class ParsedownGetImageSrc extends Parsedown {
    public $imageSrcData = [];
    public function inlineImage($Excerpt) {
        if ($Inline = parent::inlineImage($Excerpt)) {
            if (isset($Inline['element']['attributes']['src'])) {
                $this->imageSrcData[] = $Inline['element']['attributes']['src'];
            }
        }
        return $Inline;
    }
}

$parser = new ParsedownGetImageSrc;
$text = $parser->text(' ... ');

# All image `src` data now stored in `imageSrcData`
echo json_encode($parser->imageSrcData);

MarkMessa commented 4 years ago

Ok, seems to work fine. Thnks!

php > require 'Parsedown.php';
php > require 'ParsedownGetImageSrc.php';
php > $parser = new ParsedownGetImageSrc;
php > 
php > $parser->text('Lorem ![](filename1.ext) ipsum.
php ' Dolor ![][image] sit amet.
php ' [image]: filename2.ext');
php > 
php > echo json_encode($parser->imageSrcData);
["filename1.ext","filename2.ext"]

MarkMessa commented 4 years ago

@tovic

Considering that your extension requires the overhead of executing the whole Parsedown, I was considering a lighter alternative such as regex:

\!\[.*\]\((\S+)\s*.*\) to match ![title](filename.ext 'alt')
\[.+\]\:\s(\S+)(?:\s".*")? to match [image1]: image1.png "some title"

Any comment?

taufik-nurrohman commented 4 years ago

You will fail on this case:

![a](b)

aaa ![a](b) bbbb

    ![a](b)

~~~
![a](b)
~~~

aaa `![a](b)` bbb

MarkMessa commented 4 years ago

It also fail with escaped references (demo):

![a](b)

![c](d)

\![a](b)

~~~
![a](b)
~~~

`![a](b)`

Any idea how to fix that?

taufik-nurrohman commented 4 years ago

Not possible without parsing it. The other solution is to parse the Markdown syntax to HTML and search for <img> tag with DOMDocument and such. So you don’t need to extend the Parsedown class.

MarkMessa commented 4 years ago

Not possible without parsing it.

Parsing the document against the full Parsedown syntax to get just the image links is somewhat inefficient.

The other solution is to parse the Markdown syntax to HTML and search for tag with DOMDocument and such.

Again, this seems inefficient. There is a lot of overhead to create a full HTML version and then searching for tags. It would be better to search for image links directly into the markdown syntax.

taufik-nurrohman commented 4 years ago

Then just match every image URL. You should be able to get it somewhere from the internet.

/^https?:\/\/\S+\.(?:gif|jpe?g|png|svg)$/

MarkMessa commented 4 years ago

This way you will get the url from <img>, but also from <a> which is not the case.
Besides, it will fail in the following cases:

# local file path instead of url
![title](filename.ext)

# escaped reference
\![a](b)

# code block
~~~
![a](b)
~~~

# code span
`![a](b)`

Note: The current accepted answer is already fine to me. This issue of overhead is just a comment rather than a bottleneck.