Convert only heading matching link anchors

josephmturner commented 9 months ago

I'm looking into an adding support for transcluding content over HTTP with org-transclusion.el, and it would be nice to be able to transclude lone headings. Perhaps when giving "https://ushin.org/needs-list.html#care", org-web-tools-read-url-as-org could insert:

* [[#care][CARE]]

- Empathy
- Contribution
- Community
- Emotional safety
- Understanding
- Disclosure
- Compassion

instead of the whole page.

Is this in scope for org-web-tools?

Alternative ideas? AFAIK, Org-mode doesn't support search options for HTML links.

alphapapa commented 9 months ago

This is an interesting idea, but I'm not sure how universally it could be implemented. An anchor points to a tag in a position on a page; it doesn't point to a delimited piece of content. Some kind of heuristic would have to decide what content in the HTML to include, and that heuristic might not produce the expected results on various pages.

I think that something closer to what you're wanting would be better provided by filtering or selecting the HTML content with an XPath or CSS selector; there are some packages for that on MELPA. Then the filtered HTML could be passed through Pandoc and inserted as Org.

josephmturner commented 9 months ago

You're right that there's no universal solution. I tried:

(let ((output (plz 'get "https://ushin.org/needs-list.html#care"
                :as (lambda ()
                      (org-web-tools--html-to-org-with-pandoc (buffer-string) "#care")))))
  (with-current-buffer "output"
    (org-mode)
    (insert output)))

which sadly yields:

** [[#care][CARE]]

... so close! Changing the CSS selector to the unexpected selector #outline-container-care does the right thing.

(let ((output (plz 'get "https://ushin.org/needs-list.html#care"
                :as (lambda ()
                      (org-web-tools--html-to-org-with-pandoc (buffer-string) "#outline-container-care")))))
  (with-current-buffer "output"
    (org-mode)
    (insert output)))

gives

** [[#care][CARE]]

- Empathy
- Contribution
- Community
- Emotional safety
- Understanding
- Disclosure
- Compassion

But I only know to use #outline-container-care because I manually inspected the HTML.

I guess this is another reason to use Org mode. :wink:

josephmturner commented 9 months ago

What do you think of this heuristic? If the anchor points to an h1/2/3/4/5/6, also include subsequent sibling elements up until we reach another header of the same level or higher. For example, if the anchor points to an h3, we'll include subsequent content up until an h1/2/3, while h/4/5/6 would be included. If anchor points to an element besides h1/2/3/4/5/6, use it. If anchor is invalid, use the whole DOM.

Try uncommenting various links. Each HTML file has a different structure, but IMO this function handles each cleanly.

(defun org-web-tools-read-url-as-org* ()
  (interactive)
  (pcase-let* ((link
                "https://jmp.chat/faq#jabber"
                ;; "https://scripter.co/looping-through-org-mode-headings/#org-map-entries-references"
                ;; "https://ushin.org/needs-list.html#care"
                ;; "https://ushin.org/needs-list.html#nonexistent-target"
                )
               (buf "output")
               ((cl-struct url filename target) (url-generic-parse-url link))
               (output (plz 'get link
                         :as (lambda ()
                               (let* ((dom (libxml-parse-html-region))
                                      (id-content
                                       (org-web-tools--target-content dom target))
                                      (id-content-as-html-string (with-temp-buffer
                                                                   (dom-print id-content)
                                                                   (buffer-string))))
                                 (org-web-tools--html-to-org-with-pandoc id-content-as-html-string))))))
    (with-current-buffer (get-buffer-create (get-buffer-create buf))
      (erase-buffer)
      (org-mode)
      (insert output))))

(defun org-web-tools--target-content (dom target)
  "Return DOM element(s) that correspond to the TARGET.
Since anchors may refer to headings but not the text following
the heading, this function may not return the expected element."
  (let ((id-element (car (dom-by-id dom (format "\\`%s\\'" target)))))
    (pcase (car id-element)
      ((and (or 'h1 'h2 'h3 'h4 'h5 'h6)
            heading-start)
       ;; HACK: If the HTML element matches a heading, then
       ;; include it and all subsequent elements inside parent
       ;; element until next heading of same level or higher.  See
       ;; <https://github.com/alphapapa/org-web-tools/issues/72>
       (let* ((siblings (dom-children (dom-parent dom id-element)))
              (heading-position (cl-position id-element siblings))
              (next-heading-position
               (cl-position
                nil siblings
                :start (1+ heading-position)
                :test (lambda (a b)
                        (and (not (stringp b))
                             (pcase (car b)
                               ((and (or 'h1 'h2 'h3 'h4 'h5 'h6)
                                     heading-end)
                                (not (string> (symbol-name heading-start)
                                              (symbol-name heading-end))))))))))
         (append '(div ())  ; Wrap in div so all elements are rendered
                 (cl-subseq siblings heading-position next-heading-position))))
      ('nil ; Invalid target: Return whole dom.
       dom)
      (_ ; Valid non-heading target: Return it.
       id-element))))

If you think the heuristic is worth including in org-web-tools, I'll submit a PR that integrates this logic more cleanly into the existing codebase.

alphapapa commented 8 months ago

I like the way that code looks, but I'm not sure how suitable it would be for adding to this library. It seems to be for a very specific purpose. And it seems like few sites on the Web use H1-6 elements to organize their contents anymore, which would seem to limit its usefulness.

Of course, I'd be glad to feature it in documentation somewhere, so anyone that needs it could borrow the code.

Also, is there any way that XPath or CSS selectors could be used to accomplish the same thing? e.g. https://developer.mozilla.org/en-US/docs/Web/CSS/Next-sibling_combinator

josephmturner commented 8 months ago

While testing the above function on some of the Mozilla docs webpages, I realized that H1-6 are not the only HTML elements that need special-casing (<dd> should include the immediate next sibling, for example).

Since the code needs to know what element the link fragment points to before it can decide what heuristic to use, I don't know how an XPath or CSS selector alone would suffice.

I'm also unsure where this code should live. For now, I'll make the code handle some more specific HTML elements, and then I'll put it in a new file org-transclusion-http.el.

Thanks :)

EDIT: Another wrinkle is that esxml-query doesn't implement (in)direct sibling combinators (see esxml--find-nodes). Regardless, I personally find it easier to reason about code that deals with the dom rather than CSS selectors.

alphapapa / org-web-tools

Convert only heading matching link anchors #72