Serialization to HTML & .docx?

photocyte commented 1 year ago

Mentioned during office hours on 2023-05-02 Related to, but ultimately a simpler lift / subset of: https://github.com/Bioprotocols/labop/issues/195 Related to: https://github.com/Bioprotocols/labop/issues/158

I'd like to get feedback from folks, on if this proposal is a worthwhile goal.

Introduction

In short: I found a way to convert HTML with specific HTML markup, to .docx with comments overlaying particular words (using Pandoc).

See the attached HTML file for an example source HTML. This is the appropriate pandoc command: pandoc example_original.html --reference-doc='example_ref.docx' -t docx -o example.docx

(Ignore --reference-doc='example_ref.docx' that is just to apply fonts / paper sizes from an existing template, i.e. 8.5"x11" vs A4 paper size)

So, the idea is one could have LabOP “serialize” to HTML+RDFa format, and from there into a HTML+docx-convertible-comments format (as exemplified by example_original.html), and finally converted to .docx, where in this docx the necessary context for the back-translation from the .docx into LabOP RDF is specified within the comments. (This potential for back-translation is why I am calling it a serialization, rather than a specialization)

example_original.html.zip

Ordered lists can be conserved between HTML and .docx

A key point is that: Pandoc will convert the "ordered list" in HTML, into an analogous ordered list object in .docx. This has some user conveniences, like formatting, and easily adding or removing steps with re-numbering of the ordered list.

In contrast, Markdown has some competing interpretations of how to handle ordered lists. It does not handle "lettered" ordered lists (i.e. a,b,c...) by default. See: https://pandoc.org/MANUAL.html#extension-fancy_lists

If you try to make multi-level ordered lists in Markdown, you will see the major Markdown renderers will interpret it in mutually incompatible ways.

HTML is a nice format in of itself to view protocols

In example_original.html, I show the use of the <details> and <summary> tags, to make little dropdown menus for unfolding the text. It's a nice feature that is "free" in HTML5. No dynamic javascripty stuff required.

One can also put in meaningless but HTML5 supported checkboxes, to make it a simple checklist for a user:

Both HTML and .docx can embed files

Another key point is that, there are ways in both HTML and in .docx, to embed files:

In HTML, you can use <a> tags, where the href=target is a data URI scheme base64 encoded file. (shown in above image)
In .docx, you can use the OLEObject to embed arbitrary files (I don't know exactly how this works. This knowledge is more just from poking around in the .docx XML). But notably, such embedded files are also supported / downloadable / viewable when .docx is viewed in Google Doc.

The idea is: The LabOP serialized into .docx, could package necessary context (The ontologies used, the original RDF)

I don't believe Pandoc would convert a data URI embedded in HTML to an embedded file in .docx, so presumably, to get a file embedded in the HTML+RDFa or the HTML+docx-convertible-comments using data URIs, would need some custom .docx XML editing in order to get the file embedded.

Microsoft Word has user conveniences for preserving & duplicating the comments

A last key point, is that when copy-pasting text in Microsoft Word, the comments will follow them. This presumably makes for an easy way for a totally oblivious user to (*minorly) edit a LabOP protocol serialized to .docx, and have it be able to be back-translated back into LabOP RDF.

*=Having the editing of the .docx by the user, and it's back translation of the LabOP protocol into RDF representation, being robust, is a much larger question, and is not being proposed. The "MVP" solution, is simply to have the back-translator throw an error, if the user edits the LabOP .docx to the point where it cannot be trivially interpreted back into LabOP RDF.

Markdown is a lift for users, both to view, and edit

I think .docx is preferable to Markdown, as someone can add their own formatting (bolding, changing fonts, font size, print it out exactly as they want), without needing to understand Markdown at all (Understanding Markdown is a lift, for most laboratory users)

Markdown is meant for authoring for the web, whereas .docx is meant for authoring for paper. Unfortunately, where in laboratory science world where paper lab notebooks & post-it notes of protocols still reign supreme, I believe LabOP needs to consider the paper workflow with first class support.

photocyte commented 1 year ago

This was mentioned in the 2023-05-02 office hours. My notes:

"Tammi "IntentParser" for SD2 in Google Docs. In brief, this was addon for Google Docs, that would lookup potentially relevant ontology terms, from human written non-ontology-linked text. It also used stashing ontology/machine readable representation in the .docx-style comments."

danbryce commented 1 year ago

Here is a link to the article describing the Intent Parser: https://pubs.acs.org/doi/abs/10.1021/acssynbio.1c00285

On May 2, 2023, at 11:48 AM, Timothy R. Fallon, PhD @.***> wrote:

This was mentioned in the 2023-05-02 office hours. My notes:

"Tammi "IntentParser" for SD2 in Google Docs. In brief, this was addon for Google Docs, that would lookup potentially relevant ontology terms, from human written non-ontology-linked text. It also used stashing ontology/machine readable representation in the .docx-style comments."

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

Bioprotocols / labop