Closed rwols closed 4 years ago
This is a good start I think.
I feel like we could get rid of the multiple HTML parsers and use a simple regex to find images and return their URL and position in the buffer. Then download all the images and enumerate the buffer backwards replacing the src with the base64 encoded data. That way we really only have to parse the HTML once.
We just get back a list of images with start and end points in the buffer.
Well maybe not simple regex, but it's not too hard as I've done it in other projects.
For reference, I actaully have a base64 plugin for Python Markdown that replaces local images src with embedded base64. https://github.com/facelessuser/pymdown-extensions/blob/master/pymdownx/b64.py
Minimal effort is used here to find images and target their source. This code could be leveraged fairly easy. Something like this could be leveraged to easily find the src attributes and return them with a start and end pt. I can probably code it up and post it here.
@rwols Here you go, returns src and start and end point:
import re
IMAGES = ('.png', '.jpg', '.gif')
EXTERNAL = ('http://', 'https://')
RE_TAG_HTML = re.compile(
r'''(?xus)
(?:
(?P<comments>(\r?\n?\s*)<!--[\s\S]*?-->(\s*)(?=\r?\n)|<!--[\s\S]*?-->)|
(?P<open><(?P<tag>img))
(?P<attr>(?:\s+[\w\-:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'))?)*)
(?P<close>\s*(?:\/?)>)
)
'''
)
RE_TAG_LINK_ATTR = re.compile(
r'''(?xus)
(?P<attr>
(?:
(?P<name>\s+src\s*=\s*)
(?P<path>"[^"]*"|'[^']*')
)
)
'''
)
def image_parser(text):
"""Retrieve image source attributes with external png, jpg, and gif sources."""
images = []
for m in RE_TAG_HTML.finditer(text):
if m.group('comments'):
continue
start = m.start('attr')
m2 = RE_TAG_LINK_ATTR.search(m.group('attr'))
if m2:
src = m2.group('path')[1:-1]
if src[:8].lower().startswith(EXTERNAL) and src[-4:].lower().endswith(IMAGES):
s = start + m2.start('path') + 1
e = start + m2.end('path') - 1
images.append((src, s, e))
return images
if __name__ == "__main__":
html = '<p><img src="http://www.fmwconcepts.com/misc_tests/FFT_tests/lena_roundtrip/lena.jpg"/></p>'
print(image_parser(html))
Output:
[('http://www.fmwconcepts.com/misc_tests/FFT_tests/lena_roundtrip/lena.jpg', 13, 84)]
No HTML parsers needed. Then you can just slice at those points and rebuild your new string with base 64 data.
Forgot to filter out comments. Updated to do so.
Ended up reducing factoring out all the HTML parsers. Since we know the positions, we just build up the new buffer with that knowledge. Reduces a lot of the extra classes and keeps us from parsing the HTML multiple times.
IMAGES = ('.png', '.jpg', '.gif')
EXTERNAL = ('http://', 'https://')
RE_TAG_HTML = re.compile(
r'''(?xus)
(?:
(?P<avoid>
<\s*(?P<script_name>script|style)[^>]*>.*?</\s*(?P=script_name)\s*> |
(?:(\r?\n?\s*)<!--[\s\S]*?-->(\s*)(?=\r?\n)|<!--[\s\S]*?-->)
)|
(?P<open><\s*(?P<tag>img))
(?P<attr>(?:\s+[\w\-:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'))?)*)
(?P<close>\s*(?:\/?)>)
)
'''
)
RE_TAG_LINK_ATTR = re.compile(
r'''(?xus)
(?P<attr>
(?:
(?P<name>\s+src\s*=\s*)
(?P<path>"[^"]*"|'[^']*')
)
)
'''
)
def image_parser(text):
"""Retrieve image source attributes with external png, jpg, and gif sources."""
images = []
for m in RE_TAG_HTML.finditer(text):
if m.group('avoid'):
continue
start = m.start('attr')
m2 = RE_TAG_LINK_ATTR.search(m.group('attr'))
if m2:
src = m2.group('path')[1:-1]
if src[:8].lower().startswith(EXTERNAL) and src[-4:].lower().endswith(IMAGES):
s = start + m2.start('path') + 1
e = start + m2.end('path') - 1
images.append((src, s, e))
return images
class _ImageResolver:
def __init__(self, minihtml, resolver, done_callback, images_to_resolve):
self.minihtml = minihtml
self.finalhtml = [minihtml[:images_to_resolve[0][1]]]
self.done_callback = done_callback
self.images_to_resolve = images_to_resolve
self.resolved = 0
total_images = len(images_to_resolve) - 1
for index, value in enumerate(images_to_resolve):
url, start, end = value
next_index = index + 1
if next_index > total_images:
next_start = len(minihtml)
else:
next_start = images_to_resolve[next_index][1]
resolver(url, functools.partial(self.on_image_resolved, url, end, next_start))
def on_image_resolved(self, url, start, end, data):
path = urllib.parse.urlparse(url).path.lower()
if path.endswith(".jpg") or path.endswith(".jpeg"):
mime = "image/jpeg"
elif path.endswith(".png"):
mime = "image/png"
elif path.endswith(".gif"):
mime = "image/gif"
else:
# last effort
mime = "image/png"
self.finalhtml.append("data:{};base64,{}".format(mime, base64.b64encode(data).decode("ascii")))
self.finalhtml.append(self.minihtml[start:end])
self.resolved += 1
if self.resolved == len(self.images_to_resolve):
self.finalize()
def finalize(self):
self.done_callback(''.join(self.finalhtml))
def format_frontmatter(values):
"""Format values as frontmatter."""
return frontmatter.dump_frontmatter(values)
def blocking_resolver(url, done):
import urllib.request
done(urllib.request.urlopen(url).readall())
def ui_thread_resolver(url, done):
sublime.set_timeout(lambda: blocking_resolver(url, done))
def worker_thread_resolver(url, done):
sublime.set_timeout_async(lambda: blocking_resolver(url, done))
def resolve_urls(minihtml, resolver, done_callback):
"""
Given minihtml containing <img> tags with a src attribute that points to an image located on the internet, download
those images and replace the src attribute with embedded base64-encoded image data.
The first argument is minihtml as returned by the md2html function.
The second argument is a callable that shall take two arguments.
- The first argument is a URL to be downloaded.
- The second argument is a callable that shall take one argument:
- An object of type `bytes`: the raw image data. The result of downloading the image.
The third argument is a callable that shall take one argument:
- A string that is the final minihtml containing embedded base64 encoded images, ready to be presented to ST.
This function is non-blocking.
"""
images = image_parser(minihtml)
if images:
return _ImageResolver(minihtml, resolver, done_callback, images)
else:
done_callback(minihtml)
return None
I'm not necessarily against using an HTML parser, but I'd like to minimize its use to once if we do use it, but we don't have to use one either 🙂 .
If there's more than one img
tag that points to the same url, only one such image is fetched.
I've also added basic exception handling.
Good call on not downloading the same image.
Note that there's no caching going on. That is entirely up to the caller of resolve_urls
(by implementing some sort of smarter resolver
callable).
I don't think it would hurt us to keep a simple dictionary in the resolver with the image address as the key and the image data as the value.
With that said, it doesn't hurt us to not either 🤷.
It may be a "better" experience if we did add the ability.
Caching does help a lot for basic UX. Perhaps a last remaining question is whether we should protect against absurdly large images?
This is in an area I have less experience in. By which I mean, how do you prevent downloading really large images if you don't know how big they are prior to downloading? I know there must be a way to negotiate such things as canceling downloads is clearly possible in many applications, I've just never had to deal with this situation yet.
We can do it by reading the Content-Length header before doing the .readall()
, I think that's sufficient.
Sounds good to me. We can set an arbitrary limit. We could allow it to be user configurable, but I'm also fine with a hard, reasonably permissible size. I'm open to suggestions.
@rwols Is this pull considered done?
Yes.
It may need docs though.
@gir-bot lgtm
Sounds good. Thanks for taking the time to push this along. Probably would have taken much longer if we were waiting for me 🙂.
I'll have to write up some docs on this. I'll probably mark it experimental in the beginning just to see how well it performs out in the wild, but I think this will be a really good addition.
@rwols FYI, I am going to rename resolve_urls
to resolve_images
as that is a more accurate representation of what this function was added for and what it does.
Replace them with the base64 encoded image instead. The actual downloading of the images is left to the caller of the
resolve_urls
function.Let me know if I'm going in the right direction.
So far, you can run this in the ST console:
which results in:
I've written basic resolver functions that don't spawn threads:
blocking_resolver
: literally blocksui_thread_resolver
: just callssublime.set_timeout
and then calls intoblocking_resolver
worker_thread_resolver
: just callssublime.set_timeout_async
and then calls intoblocking_resolver