executablebooks / MyST-NB

Parse and execute ipynb files in Sphinx
https://myst-nb.readthedocs.io
BSD 3-Clause "New" or "Revised" License
209 stars 83 forks source link

Develop text representation of IPYNB format #14

Closed mmcky closed 4 years ago

mmcky commented 5 years ago

Develop a fully defined specification between the machine readable IPYNB and a text based representation. The emphasis will be on using one of the existing representations as much as possible (i.e. Rmarkdown).

mmcky commented 5 years ago
mmcky commented 5 years ago
mmcky commented 5 years ago

Resources:

https://nbformat.readthedocs.io/en/latest/

jlperla commented 5 years ago

One format worth putting some serious analysis into is the markdown variations in https://github.com/JunoLab/Weave.jl https://github.com/mpastell/Pweave

In particular, Weave.jl has a lot of tried and tested variations. If you look at it, a few comments:

None of this is julia specific, so the parser and code could be ported to python (or just the specification used as inspiration).

There is a benefit with this, which is a Weave based edit tooling: https://github.com/JunoLab/language-weave and https://marketplace.visualstudio.com/items?itemName=jameselderfield.language-weave

Another format to consider for inspiration is Documenter.jl as described in https://juliadocs.github.io/Documenter.jl/stable/ which has many gems

jlperla commented 5 years ago

When looking at this and the architecture of Weave (and others), the keys are:

For what it is worth, the only thing I have strong feeling on is that Jupyter should be one of many outputs and not an intermediate format. I can give you more of my hunches here at some point, but consider Weave and Pweave and Rmarkdown and others all execute code output of nbconvert.

jlperla commented 5 years ago

Finally, consider moving to minted for the latex output in any templates, which does a beautiful job of typesetting code. It relies on pygments. Here is an example of the setup: https://github.com/baggepinnen/configs/blob/master/scripts/install_pygments_julialexer.bash

choldgraf commented 4 years ago

One thing to note is that IMO we are talking about two separate things here:

  1. What enriched flavor of markdown to use in order to embed more complex information in a Jupyter notebook structure (e.g. RMarkdown)
  2. What markdown structure we wish to use to denote "cells" as well as metadata about cells (such as the cell type, tags, etc)

@jlperla I'd love to hear your thoughts on why notebooks shouldn't be an intermediate format. It's an approach that I've taken with Jupyter Book, and think it has worked quite nicely. I'll take a look at the links you've shared and think a bit about the other projects you've mentioned.

jstac commented 4 years ago

I think @mmcky has a similar idea with our jupinx system for jupyter notebooks being one of many outputs and not an intermediate form. Execution of code stored in text source takes place at a higher level, and the results are injected into the ipynbs. I'll leave it to @mmcky to correct me or elaborate.

jstac commented 4 years ago

From the end user perpective here's a few comments:

First, we want to support user A, who is not a particularly strong programmer and likes WYSIWYG but has important things to say. For example, he/she knows a bit of pandas or R and has 10 Jupyter notebooks on NLP, say. He/she wants to use our tools to turn them into a book without hassle or fuss. To my mind, this should be as simple as

  1. putting them in a common directory
  2. entering their file names into a plain text file associated with a table of contents --- all they have to do is list the file names in the order they want them to appear.
  3. pointing a binary at this directory and running it.

(Personally I would like our project to also help them with web-based publishing of static content as the fourth step. User A will definitely appreciate this.)

From there on, they should still have the option to maintain their "book" by editing directly in the notebooks.

At the same time, we have user B (people like me) who is handling many lectures with a distributed team and cannot function without a text based representation of the underlying notebooks. So we need something like RMarkdown, possibly with extra bells and whistles.

To support users A and B from within one system in a clean way, we need a slick two-way mapping between text source files and ipynb files (minus their outputs, of course).

Now

So question 1 is, is it actually possible to set up a one-to-one mapping between a variant of markdown and ipynb files that feature eq numbers, fig numbers, cites, cross-refs, etc?

Surely yes, right?

If yes, then

jlperla commented 4 years ago

One thing to note is that IMO we are talking about two separate thing

Yes, but the 2 are related because they need to have the same feature set.

Step 1 create the markup specification which represents all of the variations required for the target outputs.

In the process of creating the quantecon lectures, they have come across a large number of these and there were plenty more on the horizon. Keep in mind that almost all of these came out of necessity in the real world typesetting, rather than being an elaborate set of hypothetical features. I think that JupyterBook does a lot less in terms of features, so you may not have run into these things.

Now, take all of that stuff and say you create the markup specification to represent all of these things... then

Step 2 you need to come up with the intermediate format which represents the superset of these things at parse time.

The issue with using jupyter as the intermediate format is that it only represents a subset of the features. In fact, it is a close to the smallest subset of features! Jupyter doesn't even have equation references, for example! Pandoc has all sorts of automatic conversion utilities, but the goal of pandoc is very different in that it is trying to get a "good enough" conversion for anything rather than a high-quality output.

Of course, since jupyter is JSON you could theoretically shove a bunch of stuff around the jupyter specification, but at that point you are designing a new intermediate file format that is only superficially connected to jupyter.

Step 3 converting from an internal format to the various outputs.

@mmcky can tell me if I am wrong, but the way that jupinx does it right now is that it effectively does use Jupyter as an intermediate format. it

@mmcky may disagree with me entirely, but I think anytime there is preprocessing and post-processing of this sort, it suggests a bottleneck in the underlying intermediate form... and that bottleneck is the jupyter format (which is required for nbconvert execution)! Why not just generate the outputs directly from the intermediate form (as others do) and have variations on Jupyter notebooks generated as just one type of output.

Take all of this with a grain of salt, of course, and is me just putting on my old software engineer hat. Using Jupyter as an intermediate format in the generation process just doesn't smell right. But I have been wrong many times before.

jlperla commented 4 years ago

To support users A and B from within one system in a clean way, we need a slick two-way mapping between text source files and ipynb files (minus their outputs, of course).

You could have a one-directional jupyter -> markdown converter no problem, of course.

Are you sure bidirectional is feasible? Again, it comes down to jupyter being a limited subset of the functionality, which means a bijection isn't possible. And as soon as there isn't a bijection, people change one, start editing the ipynb, and then are confused when things are lost going backwards.

For simple usability, RST is too scary for casual users, but markdown with embedded code blocks is very clean... and with proper tooling in Atom/vscode, it can even be easy to edit wysiwyg style (e.g. try the Weave extension for Atom, or Hydrogen for executing cells).

As an example, see https://github.com/JunoLab/Weave.jl/blob/master/examples/FIR_design.jmd for the obvious markdown based representation for Weave. Basically "standard" markdown with a YAML header where the enhancements are only needed when you get fancy.

I am not sure that a format along these lines is scarier to edit than Jupyter directly. It might even be easier to work with, and helps discipline people in using a format that works with git. I believe that the R approaches are similar in getting people to write in markdown directly.

jstac commented 4 years ago

Thanks for all your thoughts @jlperla.

One small comment is that we might not actually want to support all features of all output formats, if we think the cost is too high. There might be a few cases where, say, LaTeX can do XYZ but we guide the user away from that way of doing things because it's going to bloat the text specification or complicate the internals --- and there are reasonable alternatives.

I can't think of particular examples now but let's leave ourselves open to the possiblity of imposing some restrictions if we think the trade-off is worthwhile.

jlperla commented 4 years ago

One small comment is that we might not actually want to support all features of all output formats,

Yes, completely. Which is another good reason to have a flexible intermediate representation where things can be ignored (e.g. you don't need to have unit tests for all formats, and various sorts of html formatting blocks would be ignored in generating jupyter notebooks, for example).

But I think it also raises an issue of whether there are current features that are not worth it for any output formats because they would necessitate getting them working for all formats. The one that jumps to my mind is the ability to have links to subsections in different existing documents. It is nice, and sphinx makes it work well, but I think we could live without.

jstac commented 4 years ago

Ah, no links to subsections across files! I don't know if I can live without those :-)

We have to be careful or @mmcky will jump in and tell us to forget about markdown and just use rst :-)

I mean, bookdown has these kinds of cross-references, doesn't it?

jstac commented 4 years ago

Regarding @jlperla's question about feasibility of the bi-directional mapping between souped-up md files and ipynb files, can anyone come up with a counterexample?

The question isn't actually well-posed, since we haven't specified what the souped up markdown spec is. But I guess this kind of thing is a problem: We use some sort of markdown syntax in the source file that maps to numbered equations in the notebooks. The numbered equations are achieved by injecting html into the md cells in the notebooks. But can the html be mapped back to the original syntax in the source file?

So what do we do? Impose restrictions on the kinds of ipynb files we can handle? Or change the way that Jupyter Notebooks parses the md cells to avoid injecting html in the first place?

jstac commented 4 years ago

Regarding your intermediate representation comments above, @jlperla, I think @mmcky perceives the same problems and wants to make changes to avoid them, with some open heart surgery on jupinx over the Australian summer.

jlperla commented 4 years ago

But can the html be mapped back to the original syntax in the source file?

I am sure you could do it, in theory, with enough suffering. But that is incredibly fragile because jupyter is an interactive output format rather than an intermediate markup. And what about features not supported in Jupyter? That gets lost in the one direction... In my mind, all of the escaping madness (e.g. $ on various platforms) is a good enough reason not to try to go back and forth bi-directionally.

Is it worth it just to edit in jupyter? Try using Hydrogen+Atom with even the existing RST or Weave.jl+Atom and shift-enter with the julia integration to execute cells to see if you like it. I strongly prefer it to working in Jupyter for writing permanent material. Jupyter is a great interactive platform and machanism for distribution of interactive material - but it is confusing as an editor for writing material that belong in a git repository.

And just to be clear, I am not suggesting we use Weave - just that Weave.jl and Pweave provide an example of an alternative editing platform and execution environment that works pretty well for writing single-notebook lectures and has a flexible moustache based template setup.

mmcky commented 4 years ago

Hey @jstac @jlperla i only have my mobile on me so I’ll chime in on this thread once I get home later today.

mmcky commented 4 years ago

There are a lot of things to comment on here. I plan to put together some diagrams with proposed linkages between these different layers as I think that is a good way to represent a lot of this discussion. I will add some comments as they appeared above:

minted for LaTeX (@jlperla)

I agree -- they make for much easier to read LaTeX files (as an output) which I think is a good goal. It should certainly be an option (if not a default choice).

What markdown structure we wish to use to denote "cells" @choldgraf

This is a key design choice. I see some approaches use a comment style structure to denote information to the interpreter about structure (within the markdown documents). I don't love that choice - but I think this will form a key part of the initial design discussion and effort around the format choice. I thought the hybrid approach used by IPypublish really interesting as it essentially leverages the best of both worlds by adopting an internal node type approach as opposed to an entire document type approach.

notebooks as an intermediate format

This is indeed how Jupinx currently works. It uses sphinx to convert from rst to ipynb and then the builder can execute, convert, and generally copy files etc to make websites and pdf files from the generated ipynb (executed or not). The notebook format does allow cell and document level metadata which has been useful to support passthrough information for theme supported elements such as code collapse cells. These metadata tags are used for the relevant html or pdf converter. The main issue I have found is we produce 4 different notebooks to support all the features for html generation, download notebooks that support remote images and figures, coverage notebooks that allow for code execution testing and report generation (that includes additional test code blocks, pdf notebooks for rendering pdf pages. This makes compilation slow as there is a lot of redundant code execution. So it basically works by

RST -> SPHINX -> IPYNB SETS -> EXECUTION -> CONVERSION -> HTML, PDF, ...

Our next iteration of jupinx will explore if we can move code execution up a level into sphinx as a transform (one of the four main sphinx execution stages) and build an execution engine using dask at the code-block level. If we can move execution up a level we should be able to re-use outputs for the different targets (html, pdf) and leverage direct translation using HTML and PDF Writers directly from the sphinx doctree. Thinking we would do this via a jupyter kernel and pass code-blocks through ZMQ etc. I think this would be a good design approach.

RST -> SPHINX + EXECUTION -> IPYNB, HTML, PDF ...

The alternative would be to build a tool that executes collections of Jupyter Notebook which could be designed for execution testing and error reporting but could also have chached execution. But it might be more difficult to track common inputs when the notebooks are already written by sphinx. So cache would act on each pipeline instead (which would be a bit less efficient) but could be a useful tool for notebook collections.

feasibility of the bi-directional mapping (Text Representation, IPYNB) (@jstac, @jlperla)

I think this is very important and feasible. It won't work with basic markdown but an extended version of markdown (without creating our own hopefully). Bi-directionality will allow for much easier editing and circular flow between formats. It also allow for automatic text based and notebook based inputs. I think the way forward here to to start building a one-to-one specification table of all required features with an emphasis on adopting as much of an existing markup as possible (i.e. Rmarkdown or IPyPublish mixed approach).

jlperla commented 4 years ago

Bi-directionality will allow for much easier editing and circular flow between formats. It also allow for automatic text based and notebook based inputs.

I will push back (for the last time!) on this being architecturally feasible (without significant, painful invsetment in something fragile); but also on bi-directional being a good goal in general. There are other ways to support entry-users.

The teechnical issue is the fragility of the round-trip

Consider using advanced features of the markup (e.g. labeling of equations), generate the notebook, and edit the notebook changing around a cell/editing an equation/etc. Even if the notebook only lets you do a simple subset of features, you need to deal with the roundtrip of the full model, including every feature now and in the future! It will radically change the design - for little benefit in my mind.

What if they accidentally delete or modify a hidden JSON cell that includes crucial metadata? Or if it is in generated HTML for displaying some feature in Jupyter that is cleaner in HTML/PDF. What if they copy/paste a cell and it doesn't bring around hidden JSON? What if they edit the generated HTML you are using for notebook layouts? What if they move a footnote anchor? etc. I do not see how that wouldn't be incredibly fragile at best, and suck up a huge portion of your engineering, testing, and support resources at worst.

Not to mention, you then need to write a specification for how the underlying markdown/etc. is represented in jupyter (even for things that don't make sense in jupyter!) or else a roundtrip wouldn't be possible. That takes time to specify, test, and debug, and you get no features out of it. Dealiung with all of the escaping differences alone could be a pain - especially if it is multi-lingual.

But more importantly, what is the benefit?

For users who just want to generate a few HTML or PDF from a notebook, that export functionality is already there in jupyterlab out of the box. That is not a usecase.

For the slightly more advanced use cases, RMarkdown/knitr/weave/etc. have proven it is well within a low-tech users capacity to edit a simple .md file for generating content.

What is the alternative?

Consider the following (without any roundtrip ipynb!):

I think the way forward here to to start building a one-to-one specification table of all required features with an emphasis on adopting as much of an existing markup as possible (i.e. Rmarkdown or IPyPublish mixed approach).

Completely agree. I will post a strawman approach later. But one approach is to follow the Juila specifications as much as possible which is already as consistent with R/knitr as possible. In particular

These seem to be consistent with RMarkdown as possible (especially in the code chunking options) but there is a lot missing in RMarkdown for multi-page references/etc.

This is my last unsolicited comment

Unless you ask me for direct advice, this is my last comment on this topic. Happy to discuss the details on slack or review docs if you wish.

But my suggestion is to specify the features you need, and the markdown variations that you wish to use. If you start thinking about ipynb as an intermediate format or bidirectional ipynb as being an essential feature, it will distort your design choices very easly.