alphapapa / org-web-tools

View, capture, and archive Web pages in Org-mode
GNU General Public License v3.0
647 stars 33 forks source link

Is it possible to support images? #51

Open dmitrym0 opened 2 years ago

dmitrym0 commented 2 years ago

Howdy @alphapapa, thanks for another amazing package!

I would love to download images as well. For instance this article works just fine with eww-readable, and a couple of images are critical to understanding the context.

Looking at the org-web-tools code, it appears that images are not fetched at all and therefore cannot be displayed. Pandoc support may be the other potential pitfall.

Am I on the right track, or are there other issues for supporting images that I'm not seeing?

alphapapa commented 2 years ago

Hi,

Thanks for the kind words. I'm glad it's useful to you.

Which command are you using? If you use org-web-tools-archive, you can have a gzip archive of a page, and wget can download the images if you choose.

The archive.is support doesn't seem to work anymore, and it was never officially supported, so I don't know if it would be possible to fix that. But for many Web pages (i.e. ones that don't require JavaScript to render), wget does a fine job of archiving the page and its content.

dmitrym0 commented 2 years ago

I'm using org-web-tools-insert-web-page-as-entry.

You've given me an idea though. If I can use wget to download a reasonable facsimile of the web page, I can then convert it to an epub and use nov.el to read it.

It's hard to beat the convenience of org-web-tools though for inserting the content of an article at point!

alphapapa commented 2 years ago

I see. Yes, theoretically images could be downloaded to a directory, and they could be inserted into the Org content. Maybe a better way to handle that would be to have Wget download them and make the archive, then extract the archive and use Pandoc to convert it to Org content.

dmitrym0 commented 2 years ago

I may have discovered an interesting alternative to eww-readable, readability.js, Mozilla's readable mode implementation. It understands inline images as well.

There's a web instance, readability-bot that accepts a URL and renders the contents with Readability.js.

For a test workflow:

  1. Invoke readability bot with the URL you're interested in (in a temp directory):
wget -nd -H -erobots=off --convert-links -E  -k -p \ 
'https://readability-bot.vercel.app/api/readability?url=https%3A%2F%2Fwww.nytimes.com%2Fwirecutter%2Freviews%2Fbest-robot-vacuum'

This generates a nice, flat, mirror of the page in readable mode. Note, that I'm using index.html below to refer to the root HTML file I'm interested in, but I haven't found how to convince wget to output to it yet.

  1. Invoke pandoc to generate an org file:
pandoc -f html -o output.org index.html

The org file now has embedded images as well. Though it's not quite clear how to manage all the assets just yet.

Alternatively, invoke pandoc to generate an epub:

pandoc -f html -t epub3 -o output.epub index.html

This has the benefit of being a full self contained document that can be read with nov.el and marked up with org-noter.

I had no idea pandoc supports epub generation. What an amazing piece of software.

Anyway, this is getting too long. Any thoughts on the best way to manage images when converting to org?

johnhamelink commented 1 year ago

Hi folks,

I wanted to build an archive of a blog I often refer to (gnuplotting.org) in an org-mode format.

I ended up writing this bash script:

#!/usr/bin/env bash

# This script retrieves the sitemap.xml file from the gnuplotting.org website,
# extracts a list of post names from it, and downloads the necessary data to display
# the site locally. It then converts the downloaded HTML content to org-mode format,
# allowing offline reading of the gnuplotting.org blog.

fqdn="www.gnuplotting.org"
url="http://${fqdn}"
sitemap="${url}/wp-sitemap-posts-post-1.xml"
post_slug_regex="<loc>http\:\/\/www\.gnuplotting\.org\/\K.+?(?=\/</loc>)"
content_selector="div#main"

useragent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36."

# Retrieve the sitemap.xml file, extract the post names from it into
# an array.
mapfile -t posts <<< "$(wget -qO- ${sitemap} | grep -Po "${post_slug_regex}")"

# For each post, crawl all the content we need to display it locally.
for post in "${posts[@]}"; do
    echo "Downloading ${post}"
    wget --quiet --convert-links --page-requisites --continue \
         --domains ${fqdn} --no-parent --level 5 \
         --user-agent="${useragent}" \
         -e robots=off --random-wait --restrict-file-names=unix \
         --max-redirect=0 --trust-server-names \
         "${url}/${post}/"
done

cd "${fdqn}"

echo "Building org-mode posts..."
for post in "${posts[@]}"; do
    #  Extract the main content from each post with htmlq, then
    #  convert to org with pandoc.
    cat "${post}/index.html" | htmlq -B "${content_selector}" | pandoc -f html -t org -o "${post}/${post}.org"
done

In my case, the website I was scraping was nice and simple, for more complex pages rdrview might be useful to further cut down on cruft.

Once the files are in org-mode, I merged them all together using an index.org file, using #+INCLUDE: and org-org-export-to-org.

Lastly, I used macros to clean up the output, such as unfilling all paragraphs of text, positioning captions correctly below figures, and renaming the source language from "prettyprint" to "gnuplot".