Open tonyfast opened 1 year ago
i'm worried about merging this one when it does land. i think i'd feel better about the support if i ran the parser through some hypothesis schema jawn.
this parser is super strict at the notebook and cell levels. after that the parser is free parse, but it will only pick up the source parts at the moment.
i'd like to harden it up with more tests. some questions i have are:
i'm worried about merging this one when it does land.
If not architecturally impossible, a common pattern when making something stricter would be to build, test, and ship both parsers, and initially make the strict one opt-in, then the default, then, maybe, deprecate the old one.
The parsers could be registered with entry_points
, which would also allow a downstream to ship their own parser, prebuilt or otherwise. Indeed, having lark
in the loop would be very interesting indeed for the interactive case.
In addition to clear docs on how that works in e.g. README.md
(and probably a section in CHANGELOG.md
), the API might then look like:
with Notebook():
import a_somehow_bad_notebook
# UserWarning:
# a_somehow_bad_notebook failed strict parsing, on line {line}, column {line}
#
# {lark error}
#
# The loose parser was used, but {features} will not be available. Silence this warning with:
#
# with Notebook(parser='loose'):
#
# see https://github.com/deathbeds/importnb/issues/{number}
Then offer:
with Notebook(parser="loose"):
import a_somehow_bad_notebook
And allow opting-in to hard fails:
with Notebook(parser="strict"):
import a_somehow_bad_notebook
# ImportnbParseError: a_somehow_bad_notebook.ipynb failed to parse on line {line}, column {line}
# {lark error}
Where ImportNbParseError
is a subclass of ImportError
(or something).
i think i'd feel better about the support if i ran the parser through some hypothesis schema jawn.
hypothesis
is always a win.
does introducing hypothesis change the tune when considering parallelism in the tests?
Yes, hypothesis
-based suites can benefit from parallelism with pytest-xdist
, if run together with others suites.
In CI, I'd probably treat it as a separate set of tests, either in a different folder or markers, which wouldn't even run if the fast tests fail, and likely just as part of a the fastest job (e.g. only on linux/3.12
), and also enable a --cov-fail-under
threshold.
But the above is irrelevant until the existing tests all run cleanly, ideally without doing anything to disk, sys.path
tricks, etc.
yea i love this idea. i think i can manage it in this PR. we'll have to build two different standalone parsers to ship for a few months then sunset the more permissive one.
ultimately, i think this should work out cause both the notebook and cell levels define additionalProperties: False
. as long as all the properties are captured in the grammar we should be good. there is a case where folks synthetically generate notebook like structure or introduce future keys like $schema
or @graph
@context
. in these cases, we'd need a permissive flag turned on if an only if this synthetic data is stored on disk. in theory, the stricter grammar could host both a strict and permissive version. in the permissive version we all for additionalProperties by supporting generic keys.
two different standalone parsers to ship
Yeah, this doesn't seem like a huge hit to avoid breaking things, and getting new features.
months then sunset the more permissive one.
Right, as long as there is clear messaging from the as-installed software that this is going to happen, and a timeline/trigger (e.g. first release that supports 3.13
) for when. The ad-hoc CalVer scheme can make it hard to plan for these things in any way other than after this day... but... can't go back now without a package rename or creating huge problems downstream.
generate notebook like structure or
Sure, as long as not losing features for the 99% use case of nbformat
-compliant structures. And further: formalizing and shipping the .g
files, it would be possible to make extensible grammars in interesting ways.
So if some use case requires going wild, it could subclass the machinery (even interactively), by overloading the right level of the grammar/decoder/parser/transformer/whatever (these need some .mmd
to explain them, perhaps) and passing that instead of an entry_point
name, without having to mess about with entry_points
or the standalone tooling generator.
If the anticipated needs are very predictable, having a more constrained API Notebook(additional_properties={"$schema": "object"})
, etc. could make this feasible, provided importnb[lark]
is installed, providing for a few simple cases.
A more formalized system could package and register it by normal means, which should be kept if added.
introduce future key
Changes like that happen slowly, but not worth losing features. We could add an nbformat
canary build excursion that installs from upstream's main
.
this pull request modifies the json parser to use explicit properties from the notebook format to constrain the parser.
the top-level notebook keys and cell keys are known ahead of time because of the schema that notebooks conform to. these changes hard code the notebook and cell properties into the parser.
this change will NOT work for arbitrary data formats, only those conforming to subsets of the specified schema. otherwise the parser throws errors.
this change streamlines the transformation logic.