Develop text representation of IPYNB format

executablebooks / MyST-NB

Parse and execute ipynb files in Sphinx

https://myst-nb.readthedocs.io

BSD 3-Clause "New" or "Revised" License

209 stars 83 forks source link

Develop text representation of IPYNB format #14

Closed mmcky closed 4 years ago

mmcky commented 5 years ago

Develop a fully defined specification between the machine readable IPYNB and a text based representation. The emphasis will be on using one of the existing representations as much as possible (i.e. Rmarkdown).

mmcky commented 5 years ago

[ ] review jupytext documentation on formats https://jupytext.readthedocs.io/en/latest/formats.html

mmcky commented 5 years ago

[ ] setup anu workshop to explore specification in second half of November

mmcky commented 5 years ago

[ ] @mmcky to setup a document to start documenting a specification between formats.

Resources:

https://nbformat.readthedocs.io/en/latest/

jlperla commented 5 years ago

One format worth putting some serious analysis into is the markdown variations in https://github.com/JunoLab/Weave.jl https://github.com/mpastell/Pweave

In particular, Weave.jl has a lot of tried and tested variations. If you look at it, a few comments:

You can embed code in a .md, or markdown in .jl file, and a few variations. The one to look at is the primary one, .md with embedded julia code. i.e. skip the noweb and script stuff.
For example, the code chunk options for the markdown setup described in http://weavejl.mpastell.com/stable/chunk_options/ and others which build on the specification in https://docs.julialang.org/en/v1/stdlib/Markdown/
The header is extensible and in YAML. See http://weavejl.mpastell.com/stable/usage/#Setting-document-options-in-header-1

None of this is julia specific, so the parser and code could be ported to python (or just the specification used as inspiration).

There is a benefit with this, which is a Weave based edit tooling: https://github.com/JunoLab/language-weave and https://marketplace.visualstudio.com/items?itemName=jameselderfield.language-weave

Another format to consider for inspiration is Documenter.jl as described in https://juliadocs.github.io/Documenter.jl/stable/ which has many gems

One of them is called doctest
See https://juliadocs.github.io/Documenter.jl/stable/man/doctests/
To understand the approach, you embed both text and the output in the doc. The parser checks that the output it executes exactly matches the display in the documentations. To update the text you can run a utility to generate it: See https://juliadocs.github.io/Documenter.jl/stable/man/doctests/#Fixing-Outdated-Doctests-1
This would eliminate the need for the test blocks we currently have.
[ ] @mmcky to review Pweave

jlperla commented 5 years ago

When looking at this and the architecture of Weave (and others), the keys are:

The text representation provides a super-set of all markup related features needed for all outputs
The internal representation then executes all code blocks, with registered MIME types https://github.com/JunoLab/Weave.jl/blob/master/src/plots.jl
Given the generated output, tangled into the markdown, the the various output formats are generated using a template. In particular Jupyter is not used as an intermediate format, but rather as just one output type with manageable templates
The various output types are done with templates (e.g. https://mustache.github.io/#demo) which can be made vary flexible. If you want to modify the output, you modify that rather than post-processing a jupyter. See https://github.com/JunoLab/Weave.jl/blob/master/templates/julia_tex.tpl and https://github.com/JunoLab/Weave.jl/blob/master/templates/julia_html.tpl as the default ones.

For what it is worth, the only thing I have strong feeling on is that Jupyter should be one of many outputs and not an intermediate format. I can give you more of my hunches here at some point, but consider Weave and Pweave and Rmarkdown and others all execute code output of nbconvert.

jlperla commented 5 years ago

Finally, consider moving to minted for the latex output in any templates, which does a beautiful job of typesetting code. It relies on pygments. Here is an example of the setup: https://github.com/baggepinnen/configs/blob/master/scripts/install_pygments_julialexer.bash

choldgraf commented 4 years ago

One thing to note is that IMO we are talking about two separate things here:

What enriched flavor of markdown to use in order to embed more complex information in a Jupyter notebook structure (e.g. RMarkdown)
What markdown structure we wish to use to denote "cells" as well as metadata about cells (such as the cell type, tags, etc)

@jlperla I'd love to hear your thoughts on why notebooks shouldn't be an intermediate format. It's an approach that I've taken with Jupyter Book, and think it has worked quite nicely. I'll take a look at the links you've shared and think a bit about the other projects you've mentioned.

jstac commented 4 years ago

I think @mmcky has a similar idea with our jupinx system for jupyter notebooks being one of many outputs and not an intermediate form. Execution of code stored in text source takes place at a higher level, and the results are injected into the ipynbs. I'll leave it to @mmcky to correct me or elaborate.

jstac commented 4 years ago

From the end user perpective here's a few comments:

First, we want to support user A, who is not a particularly strong programmer and likes WYSIWYG but has important things to say. For example, he/she knows a bit of pandas or R and has 10 Jupyter notebooks on NLP, say. He/she wants to use our tools to turn them into a book without hassle or fuss. To my mind, this should be as simple as

putting them in a common directory
entering their file names into a plain text file associated with a table of contents --- all they have to do is list the file names in the order they want them to appear.
pointing a binary at this directory and running it.

(Personally I would like our project to also help them with web-based publishing of static content as the fourth step. User A will definitely appreciate this.)

From there on, they should still have the option to maintain their "book" by editing directly in the notebooks.

At the same time, we have user B (people like me) who is handling many lectures with a distributed team and cannot function without a text based representation of the underlying notebooks. So we need something like RMarkdown, possibly with extra bells and whistles.

To support users A and B from within one system in a clean way, we need a slick two-way mapping between text source files and ipynb files (minus their outputs, of course).

Now

anyone who wants to work with text can do so.
anyone who wants WYSIWYG has Jupyter notebook / Jupyter lab

So question 1 is, is it actually possible to set up a one-to-one mapping between a variant of markdown and ipynb files that feature eq numbers, fig numbers, cites, cross-refs, etc?

Surely yes, right?

If yes, then

is the variation implemented by IPyPublish enough, or do we need more?
What other projects are closest to this goal and how can we help them?
Is there anything else we need to consider with the actual specification of the text based representation?
How can we make the mapping between text source and notebook WYSIWIG view as slick as possible?

jlperla commented 4 years ago

One thing to note is that IMO we are talking about two separate thing

Yes, but the 2 are related because they need to have the same feature set.

Step 1 create the markup specification which represents all of the variations required for the target outputs.

The features and extensibility of the underlying markup for writing the material has to include a superset of all features for all output formats
- latex/pdf with nice formatting options, bibtex references, footnotes, equation numbers, etc.
- html with styling/integration with templates
- hidden "setup" code blocks which shouldn't be displayed in the output but are essential for setting up code to execute in the background.
- potentially different versions of jupyter notebooks for different backends or hosts with different features... e.g. a conditional installation cell for colab).
- And different versions of jupyter output when you are generating slides (e.g. add RISE partitioning of blocks/etc.) as well as hopefully eventually allowing creation of nbgrader stuff for typeset problem sets.
- unit testing (preferably doctest style of the output blocks) which may only executed in a coverage/unit testing run for CI integration.
- images need to be displayed for access with a CDR and accessible from both jupyter notebooks (where they can't be local references), html (where they can be relative to the local location), and latex/pdf (where they need to be literally local for the latex build step).
- References to files in the specification which may be deployed as part of the repository (e.g. files such as a Manifest.toml for Julia you want managed in the source-tree, or some cached versions of data in the github repo to make things faster to use.
- Not to mention all of the practical difficulties with escaping (e.g. try getting $ in text cells, section headings, and and in \text{..} within latex blocks within to escape properly in all 3, for example!)

In the process of creating the quantecon lectures, they have come across a large number of these and there were plenty more on the horizon. Keep in mind that almost all of these came out of necessity in the real world typesetting, rather than being an elaborate set of hypothetical features. I think that JupyterBook does a lot less in terms of features, so you may not have run into these things.

Now, take all of that stuff and say you create the markup specification to represent all of these things... then

Step 2 you need to come up with the intermediate format which represents the superset of these things at parse time.

The issue with using jupyter as the intermediate format is that it only represents a subset of the features. In fact, it is a close to the smallest subset of features! Jupyter doesn't even have equation references, for example! Pandoc has all sorts of automatic conversion utilities, but the goal of pandoc is very different in that it is trying to get a "good enough" conversion for anything rather than a high-quality output.

Of course, since jupyter is JSON you could theoretically shove a bunch of stuff around the jupyter specification, but at that point you are designing a new intermediate file format that is only superficially connected to jupyter.

Step 3 converting from an internal format to the various outputs.

@mmcky can tell me if I am wrong, but the way that jupinx does it right now is that it effectively does use Jupyter as an intermediate format. it

takes the superset of functionality in the RST and parses it with sphinx, which is very nice in that it provides an extensible specification for all sorts of things like bibtex/equation references/section references between lectures/etc.
it generates a jupyter notebook, but adds in a bunch of stuff for post-processing.
runs nbconvert - which I don't think is really intended as a production quality generation tool.
then does a whole bunch of postprocessing to try to get images to work with the cdr, get markup variations to display correctly (e.g. I don't think my $ example works, even now)

@mmcky may disagree with me entirely, but I think anytime there is preprocessing and post-processing of this sort, it suggests a bottleneck in the underlying intermediate form... and that bottleneck is the jupyter format (which is required for nbconvert execution)! Why not just generate the outputs directly from the intermediate form (as others do) and have variations on Jupyter notebooks generated as just one type of output.

Take all of this with a grain of salt, of course, and is me just putting on my old software engineer hat. Using Jupyter as an intermediate format in the generation process just doesn't smell right. But I have been wrong many times before.

jlperla commented 4 years ago

To support users A and B from within one system in a clean way, we need a slick two-way mapping between text source files and ipynb files (minus their outputs, of course).

You could have a one-directional jupyter -> markdown converter no problem, of course.

Are you sure bidirectional is feasible? Again, it comes down to jupyter being a limited subset of the functionality, which means a bijection isn't possible. And as soon as there isn't a bijection, people change one, start editing the ipynb, and then are confused when things are lost going backwards.

For simple usability, RST is too scary for casual users, but markdown with embedded code blocks is very clean... and with proper tooling in Atom/vscode, it can even be easy to edit wysiwyg style (e.g. try the Weave extension for Atom, or Hydrogen for executing cells).

As an example, see https://github.com/JunoLab/Weave.jl/blob/master/examples/FIR_design.jmd for the obvious markdown based representation for Weave. Basically "standard" markdown with a YAML header where the enhancements are only needed when you get fancy.

I am not sure that a format along these lines is scarier to edit than Jupyter directly. It might even be easier to work with, and helps discipline people in using a format that works with git. I believe that the R approaches are similar in getting people to write in markdown directly.

jstac commented 4 years ago

Thanks for all your thoughts @jlperla.

One small comment is that we might not actually want to support all features of all output formats, if we think the cost is too high. There might be a few cases where, say, LaTeX can do XYZ but we guide the user away from that way of doing things because it's going to bloat the text specification or complicate the internals --- and there are reasonable alternatives.

I can't think of particular examples now but let's leave ourselves open to the possiblity of imposing some restrictions if we think the trade-off is worthwhile.

jlperla commented 4 years ago

One small comment is that we might not actually want to support all features of all output formats,

Yes, completely. Which is another good reason to have a flexible intermediate representation where things can be ignored (e.g. you don't need to have unit tests for all formats, and various sorts of html formatting blocks would be ignored in generating jupyter notebooks, for example).

But I think it also raises an issue of whether there are current features that are not worth it for any output formats because they would necessitate getting them working for all formats. The one that jumps to my mind is the ability to have links to subsections in different existing documents. It is nice, and sphinx makes it work well, but I think we could live without.

jstac commented 4 years ago

Ah, no links to subsections across files! I don't know if I can live without those :-)

We have to be careful or @mmcky will jump in and tell us to forget about markdown and just use rst :-)

I mean, bookdown has these kinds of cross-references, doesn't it?

jstac commented 4 years ago

Regarding @jlperla's question about feasibility of the bi-directional mapping between souped-up md files and ipynb files, can anyone come up with a counterexample?

The question isn't actually well-posed, since we haven't specified what the souped up markdown spec is. But I guess this kind of thing is a problem: We use some sort of markdown syntax in the source file that maps to numbered equations in the notebooks. The numbered equations are achieved by injecting html into the md cells in the notebooks. But can the html be mapped back to the original syntax in the source file?

So what do we do? Impose restrictions on the kinds of ipynb files we can handle? Or change the way that Jupyter Notebooks parses the md cells to avoid injecting html in the first place?

jstac commented 4 years ago

Regarding your intermediate representation comments above, @jlperla, I think @mmcky perceives the same problems and wants to make changes to avoid them, with some open heart surgery on jupinx over the Australian summer.

jlperla commented 4 years ago

But can the html be mapped back to the original syntax in the source file?

I am sure you could do it, in theory, with enough suffering. But that is incredibly fragile because jupyter is an interactive output format rather than an intermediate markup. And what about features not supported in Jupyter? That gets lost in the one direction... In my mind, all of the escaping madness (e.g. $ on various platforms) is a good enough reason not to try to go back and forth bi-directionally.

Is it worth it just to edit in jupyter? Try using Hydrogen+Atom with even the existing RST or Weave.jl+Atom and shift-enter with the julia integration to execute cells to see if you like it. I strongly prefer it to working in Jupyter for writing permanent material. Jupyter is a great interactive platform and machanism for distribution of interactive material - but it is confusing as an editor for writing material that belong in a git repository.

And just to be clear, I am not suggesting we use Weave - just that Weave.jl and Pweave provide an example of an alternative editing platform and execution environment that works pretty well for writing single-notebook lectures and has a flexible moustache based template setup.

mmcky commented 4 years ago

Hey @jstac @jlperla i only have my mobile on me so I’ll chime in on this thread once I get home later today.

mmcky commented 4 years ago

There are a lot of things to comment on here. I plan to put together some diagrams with proposed linkages between these different layers as I think that is a good way to represent a lot of this discussion. I will add some comments as they appeared above:

minted for LaTeX (@jlperla)

I agree -- they make for much easier to read LaTeX files (as an output) which I think is a good goal. It should certainly be an option (if not a default choice).

What markdown structure we wish to use to denote "cells" @choldgraf

This is a key design choice. I see some approaches use a comment style structure to denote information to the interpreter about structure (within the markdown documents). I don't love that choice - but I think this will form a key part of the initial design discussion and effort around the format choice. I thought the hybrid approach used by IPypublish really interesting as it essentially leverages the best of both worlds by adopting an internal node type approach as opposed to an entire document type approach.

notebooks as an intermediate format

This is indeed how Jupinx currently works. It uses sphinx to convert from rst to ipynb and then the builder can execute, convert, and generally copy files etc to make websites and pdf files from the generated ipynb (executed or not). The notebook format does allow cell and document level metadata which has been useful to support passthrough information for theme supported elements such as code collapse cells. These metadata tags are used for the relevant html or pdf converter. The main issue I have found is we produce 4 different notebooks to support all the features for html generation, download notebooks that support remote images and figures, coverage notebooks that allow for code execution testing and report generation (that includes additional test code blocks, pdf notebooks for rendering pdf pages. This makes compilation slow as there is a lot of redundant code execution. So it basically works by

RST -> SPHINX -> IPYNB SETS -> EXECUTION -> CONVERSION -> HTML, PDF, ...

Our next iteration of jupinx will explore if we can move code execution up a level into sphinx as a transform (one of the four main sphinx execution stages) and build an execution engine using dask at the code-block level. If we can move execution up a level we should be able to re-use outputs for the different targets (html, pdf) and leverage direct translation using HTML and PDF Writers directly from the sphinx doctree. Thinking we would do this via a jupyter kernel and pass code-blocks through ZMQ etc. I think this would be a good design approach.

RST -> SPHINX + EXECUTION -> IPYNB, HTML, PDF ...

The alternative would be to build a tool that executes collections of Jupyter Notebook which could be designed for execution testing and error reporting but could also have chached execution. But it might be more difficult to track common inputs when the notebooks are already written by sphinx. So cache would act on each pipeline instead (which would be a bit less efficient) but could be a useful tool for notebook collections.

feasibility of the bi-directional mapping (Text Representation, IPYNB) (@jstac, @jlperla)

I think this is very important and feasible. It won't work with basic markdown but an extended version of markdown (without creating our own hopefully). Bi-directionality will allow for much easier editing and circular flow between formats. It also allow for automatic text based and notebook based inputs. I think the way forward here to to start building a one-to-one specification table of all required features with an emphasis on adopting as much of an existing markup as possible (i.e. Rmarkdown or IPyPublish mixed approach).

jlperla commented 4 years ago

Bi-directionality will allow for much easier editing and circular flow between formats. It also allow for automatic text based and notebook based inputs.

I will push back (for the last time!) on this being architecturally feasible (without significant, painful invsetment in something fragile); but also on bi-directional being a good goal in general. There are other ways to support entry-users.

The teechnical issue is the fragility of the round-trip

Consider using advanced features of the markup (e.g. labeling of equations), generate the notebook, and edit the notebook changing around a cell/editing an equation/etc. Even if the notebook only lets you do a simple subset of features, you need to deal with the roundtrip of the full model, including every feature now and in the future! It will radically change the design - for little benefit in my mind.

What if they accidentally delete or modify a hidden JSON cell that includes crucial metadata? Or if it is in generated HTML for displaying some feature in Jupyter that is cleaner in HTML/PDF. What if they copy/paste a cell and it doesn't bring around hidden JSON? What if they edit the generated HTML you are using for notebook layouts? What if they move a footnote anchor? etc. I do not see how that wouldn't be incredibly fragile at best, and suck up a huge portion of your engineering, testing, and support resources at worst.

Not to mention, you then need to write a specification for how the underlying markdown/etc. is represented in jupyter (even for things that don't make sense in jupyter!) or else a roundtrip wouldn't be possible. That takes time to specify, test, and debug, and you get no features out of it. Dealiung with all of the escaping differences alone could be a pain - especially if it is multi-lingual.

But more importantly, what is the benefit?

For users who just want to generate a few HTML or PDF from a notebook, that export functionality is already there in jupyterlab out of the box. That is not a usecase.

For the slightly more advanced use cases, RMarkdown/knitr/weave/etc. have proven it is well within a low-tech users capacity to edit a simple .md file for generating content.

What is the alternative?

Consider the following (without any roundtrip ipynb!):

The markdown format is as standard as possible and uses commonality of markdown extensions to whatever ability is possible. And for simple stuff, this is 100% possible.
There is a one-directional ipynb to markdown convertor which users could run and then modify the results if they wished.
- In fact, nbconvertor/pandoc probably have everything you need because it doesn't need to be perfect.
In addition: ipynb content can be supported in the system (i.e. not everything needs to be in the markdown files) simply by using the one-directional transformation as a step in the build process. i.e. users can keep things in ipynb
- So, for people who really want to edit in jupyter notebooks, or just to include an existing jupyter notebooi, they could do it and have the one-directinoal generation work reasonably well.
- If people want any of the advanced features, they need to do it in markdown. At which point there is no round-trip fragility because you have only promised minimal features out of ipynb.
Consider that the RMarkdown/knitr/weave/etc. have proven that markdown is very reasonable for non-technical datascientists to write and maintain.
- If the main purpose of ipynb was to make it easier for people to write up lecture notes, I think the evidence points towards simple text files being great... and potentially even easier than using jupyter directly.
- I should point out that this is my main reason why I think RST is not appropriate. Saying that "users can just as easily write markdown as use a notebook interface" doesn't apply with RST. If things stuck with RST, then trying to generate an ipynb wsysiwyg editor is more defensible.
Finally, for tooling, consider adapting previewers for VSCode/Atom with jupyter support. and exploitig the rich set of extensions already there.
- Also, a standard approach seems to be making a docker for things like sphinx, latex, and others so that the editor has a setup-free preview possible. It is slick!
- For people planning to write content and put it into github, the tooling of these things is spectacular, and the editors are top notch. Eventually vscode will even work online!

I think the way forward here to to start building a one-to-one specification table of all required features with an emphasis on adopting as much of an existing markup as possible (i.e. Rmarkdown or IPyPublish mixed approach).

Completely agree. I will post a strawman approach later. But one approach is to follow the Juila specifications as much as possible which is already as consistent with R/knitr as possible. In particular

These seem to be consistent with RMarkdown as possible (especially in the code chunking options) but there is a lot missing in RMarkdown for multi-page references/etc.

This is my last unsolicited comment

Unless you ask me for direct advice, this is my last comment on this topic. Happy to discuss the details on slack or review docs if you wish.

But my suggestion is to specify the features you need, and the markdown variations that you wish to use. If you start thinking about ipynb as an intermediate format or bidirectional ipynb as being an essential feature, it will distort your design choices very easly.