Resolve http:// and https:// image URLs - Githubissues

facelessuser / sublime-markdown-popups

Markdown popup dependency for Sublime

https://facelessuser.github.io/sublime-markdown-popups/

Other

112 stars 13 forks source link

Resolve http:// and https:// image URLs #107

Closed rwols closed 4 years ago

rwols commented 4 years ago

Replace them with the base64 encoded image instead. The actual downloading of the images is left to the caller of the resolve_urls function.

Let me know if I'm going in the right direction.

So far, you can run this in the ST console:

Python 3.3>>> from mdpopups import worker_thread_resolver
Python 3.3>>> from mdpopups import resolve_urls
Python 3.3>>> resolve_urls('<p><img src="http://www.fmwconcepts.com/misc_tests/FFT_tests/lena_roundtrip/lena.jpg"/></p>', worker_thread_resolver, lambda minihtml: view.show_popup(minihtml, max_width=800, max_height=800))

which results in:

I've written basic resolver functions that don't spawn threads:

blocking_resolver: literally blocks
ui_thread_resolver: just calls sublime.set_timeout and then calls into blocking_resolver
worker_thread_resolver: just calls sublime.set_timeout_async and then calls into blocking_resolver

facelessuser commented 4 years ago

This is a good start I think.

I feel like we could get rid of the multiple HTML parsers and use a simple regex to find images and return their URL and position in the buffer. Then download all the images and enumerate the buffer backwards replacing the src with the base64 encoded data. That way we really only have to parse the HTML once.

We just get back a list of images with start and end points in the buffer.

facelessuser commented 4 years ago

Well maybe not simple regex, but it's not too hard as I've done it in other projects.

facelessuser commented 4 years ago

For reference, I actaully have a base64 plugin for Python Markdown that replaces local images src with embedded base64. https://github.com/facelessuser/pymdown-extensions/blob/master/pymdownx/b64.py

Minimal effort is used here to find images and target their source. This code could be leveraged fairly easy. Something like this could be leveraged to easily find the src attributes and return them with a start and end pt. I can probably code it up and post it here.

facelessuser commented 4 years ago

@rwols Here you go, returns src and start and end point:

import re

IMAGES = ('.png', '.jpg', '.gif')
EXTERNAL = ('http://', 'https://')

RE_TAG_HTML = re.compile(
    r'''(?xus)
    (?:
        (?P<comments>(\r?\n?\s*)<!--[\s\S]*?-->(\s*)(?=\r?\n)|<!--[\s\S]*?-->)|
        (?P<open><(?P<tag>img))
        (?P<attr>(?:\s+[\w\-:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'))?)*)
        (?P<close>\s*(?:\/?)>)
    )
    '''
)

RE_TAG_LINK_ATTR = re.compile(
    r'''(?xus)
    (?P<attr>
        (?:
            (?P<name>\s+src\s*=\s*)
            (?P<path>"[^"]*"|'[^']*')
        )
    )
    '''
)

def image_parser(text):
    """Retrieve image source attributes with external png, jpg, and gif sources."""

    images = []
    for m in RE_TAG_HTML.finditer(text):
        if m.group('comments'):
            continue
        start = m.start('attr')
        m2 = RE_TAG_LINK_ATTR.search(m.group('attr'))
        if m2:
            src = m2.group('path')[1:-1]
            if src[:8].lower().startswith(EXTERNAL) and src[-4:].lower().endswith(IMAGES):
                s = start + m2.start('path') + 1
                e = start + m2.end('path') - 1
                images.append((src, s, e))
    return images

if __name__ == "__main__":
    html = '<p><img src="http://www.fmwconcepts.com/misc_tests/FFT_tests/lena_roundtrip/lena.jpg"/></p>'
    print(image_parser(html))

Output:

[('http://www.fmwconcepts.com/misc_tests/FFT_tests/lena_roundtrip/lena.jpg', 13, 84)]

No HTML parsers needed. Then you can just slice at those points and rebuild your new string with base 64 data.

facelessuser commented 4 years ago

Forgot to filter out comments. Updated to do so.

facelessuser commented 4 years ago

Ended up reducing factoring out all the HTML parsers. Since we know the positions, we just build up the new buffer with that knowledge. Reduces a lot of the extra classes and keeps us from parsing the HTML multiple times.

IMAGES = ('.png', '.jpg', '.gif')
EXTERNAL = ('http://', 'https://')

RE_TAG_HTML = re.compile(
    r'''(?xus)
    (?:
        (?P<avoid>
            <\s*(?P<script_name>script|style)[^>]*>.*?</\s*(?P=script_name)\s*> |
            (?:(\r?\n?\s*)<!--[\s\S]*?-->(\s*)(?=\r?\n)|<!--[\s\S]*?-->)
        )|
        (?P<open><\s*(?P<tag>img))
        (?P<attr>(?:\s+[\w\-:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'))?)*)
        (?P<close>\s*(?:\/?)>)
    )
    '''
)

RE_TAG_LINK_ATTR = re.compile(
    r'''(?xus)
    (?P<attr>
        (?:
            (?P<name>\s+src\s*=\s*)
            (?P<path>"[^"]*"|'[^']*')
        )
    )
    '''
)

def image_parser(text):
    """Retrieve image source attributes with external png, jpg, and gif sources."""

    images = []
    for m in RE_TAG_HTML.finditer(text):
        if m.group('avoid'):
            continue
        start = m.start('attr')
        m2 = RE_TAG_LINK_ATTR.search(m.group('attr'))
        if m2:
            src = m2.group('path')[1:-1]
            if src[:8].lower().startswith(EXTERNAL) and src[-4:].lower().endswith(IMAGES):
                s = start + m2.start('path') + 1
                e = start + m2.end('path') - 1
                images.append((src, s, e))
    return images

class _ImageResolver:

    def __init__(self, minihtml, resolver, done_callback, images_to_resolve):
        self.minihtml = minihtml
        self.finalhtml = [minihtml[:images_to_resolve[0][1]]]
        self.done_callback = done_callback
        self.images_to_resolve = images_to_resolve
        self.resolved = 0
        total_images = len(images_to_resolve) - 1
        for index, value in enumerate(images_to_resolve):
            url, start, end = value
            next_index = index + 1
            if next_index > total_images:
                next_start = len(minihtml)
            else:
                next_start = images_to_resolve[next_index][1]
            resolver(url, functools.partial(self.on_image_resolved, url, end, next_start))

    def on_image_resolved(self, url, start, end, data):
        path = urllib.parse.urlparse(url).path.lower()
        if path.endswith(".jpg") or path.endswith(".jpeg"):
            mime = "image/jpeg"
        elif path.endswith(".png"):
            mime = "image/png"
        elif path.endswith(".gif"):
            mime = "image/gif"
        else:
            # last effort
            mime = "image/png"

        self.finalhtml.append("data:{};base64,{}".format(mime, base64.b64encode(data).decode("ascii")))
        self.finalhtml.append(self.minihtml[start:end])
        self.resolved += 1
        if self.resolved == len(self.images_to_resolve):
            self.finalize()

    def finalize(self):
        self.done_callback(''.join(self.finalhtml))

def format_frontmatter(values):
    """Format values as frontmatter."""

    return frontmatter.dump_frontmatter(values)

def blocking_resolver(url, done):
    import urllib.request
    done(urllib.request.urlopen(url).readall())

def ui_thread_resolver(url, done):
    sublime.set_timeout(lambda: blocking_resolver(url, done))

def worker_thread_resolver(url, done):
    sublime.set_timeout_async(lambda: blocking_resolver(url, done))

def resolve_urls(minihtml, resolver, done_callback):
    """
    Given minihtml containing <img> tags with a src attribute that points to an image located on the internet, download
    those images and replace the src attribute with embedded base64-encoded image data.
    The first argument is minihtml as returned by the md2html function.
    The second argument is a callable that shall take two arguments.
    - The first argument is a URL to be downloaded.
    - The second argument is a callable that shall take one argument:
      - An object of type `bytes`: the raw image data. The result of downloading the image.
    The third argument is a callable that shall take one argument:
    - A string that is the final minihtml containing embedded base64 encoded images, ready to be presented to ST.
    This function is non-blocking.
    """
    images = image_parser(minihtml)
    if images:
        return _ImageResolver(minihtml, resolver, done_callback, images)
    else:
        done_callback(minihtml)
        return None

facelessuser commented 4 years ago

I'm not necessarily against using an HTML parser, but I'd like to minimize its use to once if we do use it, but we don't have to use one either 🙂 .

rwols commented 4 years ago

If there's more than one img tag that points to the same url, only one such image is fetched.

I've also added basic exception handling.

Schermafbeelding 2020-10-15 om 21 15 52

facelessuser commented 4 years ago

Good call on not downloading the same image.

rwols commented 4 years ago

Note that there's no caching going on. That is entirely up to the caller of resolve_urls (by implementing some sort of smarter resolver callable).

facelessuser commented 4 years ago

I don't think it would hurt us to keep a simple dictionary in the resolver with the image address as the key and the image data as the value.

With that said, it doesn't hurt us to not either 🤷.

It may be a "better" experience if we did add the ability.

rwols commented 4 years ago

Caching does help a lot for basic UX. Perhaps a last remaining question is whether we should protect against absurdly large images?

facelessuser commented 4 years ago

This is in an area I have less experience in. By which I mean, how do you prevent downloading really large images if you don't know how big they are prior to downloading? I know there must be a way to negotiate such things as canceling downloads is clearly possible in many applications, I've just never had to deal with this situation yet.

rwols commented 4 years ago

We can do it by reading the Content-Length header before doing the .readall(), I think that's sufficient.

facelessuser commented 4 years ago

Sounds good to me. We can set an arbitrary limit. We could allow it to be user configurable, but I'm also fine with a hard, reasonably permissible size. I'm open to suggestions.

facelessuser commented 4 years ago

@rwols Is this pull considered done?

rwols commented 4 years ago

Yes.

rwols commented 4 years ago

It may need docs though.

facelessuser commented 4 years ago

@gir-bot lgtm

Sounds good. Thanks for taking the time to push this along. Probably would have taken much longer if we were waiting for me 🙂.

I'll have to write up some docs on this. I'll probably mark it experimental in the beginning just to see how well it performs out in the wild, but I think this will be a really good addition.

facelessuser commented 4 years ago

@rwols FYI, I am going to rename resolve_urls to resolve_images as that is a more accurate representation of what this function was added for and what it does.