executablebooks / MyST-NB

Parse and execute ipynb files in Sphinx
https://myst-nb.readthedocs.io
BSD 3-Clause "New" or "Revised" License
200 stars 80 forks source link

With NumPy 2.0, `text/plain` + float formatting results in `SphinxWarning` #610

Closed bryanwweber closed 4 days ago

bryanwweber commented 2 weeks ago

Describe the bug

context When I do glue a NumPy float into a page with the {glue:text}`variable:<float format>`, I get a SphinxWarning:

sphinx.errors.SphinxWarning: <filename>:257:Failed to format text/plain data: could not convert string to float: 'np.float64(0.9643970836113095)' [mystnb.glue]

expectation I expected no warning to occur.

problem This is a problem for people doing formatting floating point NumPy objects because if they turn warnings to errors, their pages won't build.

The cause is described here: https://numpy.org/devdocs/release/2.0.0-notes.html#representation-of-numpy-scalars-changed

Basically, the repr of NumPy types changed to include the NumPy type. As such, the IPython display formatter outputs the literal 'numpy.float64(1.2345)' which obviously can't be formatted as a float. I'm not sure if this should be filed here or with IPython... I landed on here because it affects the glue functionality in particular and because it seems like the display formatter for NumPy types in IPython is probably doing the right thing.

The immediate workaround is as suggested in the NumPy release notes, setting np.set_printoptions(legacy="1.25") fixes the warning.

Reproduce the bug

  1. Install NumPy 2.0
  2. Glue and print a variable:
    ```{code-cell}
    from myst_nb import glue
    import numpy as np
    a = np.float64(1.234)
    glue("var-a", a)

    {glue:text}:var-a:.2F

  3. Build a book/the page and see the warning/error if -W is on.

List your environment

❯ jupyter-book --version
Jupyter Book      : 1.0.0
External ToC      : 1.0.1
MyST-Parser       : 2.0.0
MyST-NB           : 1.1.0
Sphinx Book Theme : 1.1.3
Jupyter-Cache     : 1.0.0
NbClient          : 0.10.0

and most importantly, NumPy >=2.0

bryanwweber commented 2 weeks ago

Also want to recognize that as of this comment, NumPy 2.0 was released approximately 5 hours ago, so I'm way ahead of the curve. Happy to help contribute a fix as needed!

flying-sheep commented 4 days ago

Also want to recognize that as of this comment, NumPy 2.0 was released approximately 5 hours ago, so I'm way ahead of the curve. Happy to help contribute a fix as needed!

numpy has pre-releases, so it’s very possible to catch and fix issues with it before the official release :stuck_out_tongue:

agoose77 commented 4 days ago

I don't think this is something that we should try to solve. glue(name, object) is assuming that we can use eval for the text/plain formatting of floats. This is true for built-in floats, but no-longer true (without importing NumPy into the execution context) for NumPy floats.

I don't think we should special-case NumPy here, instead the user should either set the print options, or store the glue object as a Python float. Glue should be given something that's serialisable.

I'm not 100% firm on this personally; I can see merit in a less cumbersome UX. But, with my numeric-hat on; NumPy floats are not Python floats, and this distinction matters.

bryanwweber commented 3 days ago

But, with my numeric-hat on; NumPy floats are not Python floats, and this distinction matters.

Agreed in all the contexts where we care what types are being used! But, in this specific case of text formatting for prose, I'd argue the distinction doesn't matter; as a user, I simply want something that I'm using as a number to show up as a specific format.

I don't think we should special-case NumPy here, instead the user should either set the print options, or store the glue object as a Python float... I can see merit in a less cumbersome UX.

I understand that special-casing NumPy here is quite fraught. Where does it stop, should we also do this for SciPy objects? etc. I personally think NumPy floats are common enough that having a special case will be a huge benefit to avoid questions - and especially inscrutable errors/warnings! That said, although I think a fix is warranted, I don't whether it belongs here or in IPython or even in NumPy. Ideally, NumPy objects would know how to format themselves; on the other hand, glue is using the text/plain mimetype here, which should also be displayed in the REPL where clearly indicating numpy.float is important. Adding a new mimetype might be an option, but then that would have to go into IPython and support added here in MyST-NB, and it would be even more confusing that the current situation. Another option would be to avoid serializing to text before actually pasting, but that's a total reversal of how the gluing system works now where all the mimetypes available for an object are stored as metadata on a cell.

The only reasonable solution I can see in the short term is to do something like (this is pseudo-code):

def glue(...):
    try:
        import numpy
    except ImportError:
        numpy = None
    if numpy is not None and isinstance(variable, numpy.float):
        variable = float(variable)
    mimebundle, metadata = IPython.core.formatters.format_display_data(variable)
    ...

but then that takes away options from end users as far as how they want to do the conversion.

So, after all that, I've convinced myself that the status quo is the best option, with the exception that I think this behavior should be documented (I had to find it in the NumPy release notes!). I'll write up a PR for a doc improvement, probably over the weekend or early next week if no one beats me to it :smile:

agoose77 commented 3 days ago

Yes, I went back and forth on this too. Ultimately, doing any kind of float conversion is "magic" and changes what is actually "stored". We should do a docs pass, I think.

flying-sheep commented 3 days ago

glue:text seems to be fundamentally flawed if ends up taking a round-trip through repr before applying formatting specifiers.

If you want to serialize things for later, you need to use a serialization format, not repr, which is documented to “often” be able to work as one, but not reliably:

Return a string containing a printable representation of an object. For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval();

The only standard Python serialization format is pickle, which would mean that you’d have to have the same packages installed in the environment that unpickles the glued values as there were when pickling them.

Alternatively you could make the serializer/deserializer configurable, and provide a default that can handle basic Python types and throws an error when it can’t handle something (e.g. json.dumps/loads).

agoose77 commented 3 days ago

@flying-sheep it would seem to me that the intention of glue has been muddied. At its core, it provides a way to name MIME-bundles. The glue:text functionality is an addition which implies interpretation of the contents, i.e. glue:text assumes the content is a number if you pass in the proper format specifiers. To do that properly, one should perform the MIME-bundle generation at document-rendering time, rather than at execution time. But that would introduce peculiar bugs; the execution environment would no longer have any control over the representation of the object, etc., and moreover, it would presume a specific kernel type.

For now, I think the best path is what we have. Long-term, things like the {eval} role are a better way to do this - shifting the specific formatting work to the kernel.

bryanwweber commented 2 days ago

I was just looking at the formatting code:

https://github.com/executablebooks/MyST-NB/blob/6ce30cd41fa82543e0f315ac8bbee82669b0cc82/myst_nb/core/variables.py#L202-L218

I wonder if it makes sense to check whether the string numpy or np exists in the text and extract the float from it? Something like:

def format_plain_text(text: str, fmt_spec: str) -> str:
    ...
    elif type_char in ("e", "E", "f", "F", "g", "G", "n", "%"):
        import re
        is_np = re.match(text, r"(np)|(numpy)\.float\d+\((\d+\.\d*)\))
        if is_np is not None:
             value = is_np.group(1)
        value = float(value)
    return format(value, fmt_spec)

Other than the problem with special-casing NumPy, the only problems I see here are numpy not imported as np or numpy and that the precision in the repr is limited. This would at least resolve most of the most common cases, I expect...