Floating point issue with NaN values

danilomendesdias commented 4 years ago

Hi, there. Thanks for providing this nice library!

I found myself in a problem here that is related to the following lines:

from strictyaml import Map, Float, load

schema = Map(
    {
        "number_one": Float(),
        "not_a_number": Float(),
    }
)

yaml_string = (
    "number_one: 1.0\n"
    "not_a_number: .nan"
)

load(yaml_string=yaml_string, schema=schema)

and I got this:

strictyaml.exceptions.YAMLValidationError: when expecting a float
found arbitrary text
  in "<unicode string>", line 2, column 1:
    not_a_number: '.nan'
    ^ (line: 2)

According to https://yaml.org/spec/1.2/spec.html#id2804092, .nan, .inf or -.inf are part of the canonical form of the floating point representation.

(In fact, I would like to do that using dirty_load, as I need a readable representation for my matrices (I'm using Seq(Seq(Float))), but that's another story, as dirty_load seems to be just another way of calling generic_load with allow_flow_style.)

Am I missing something here or is there a workaround for it?

Thank you

crdoconnor commented 4 years ago

Hi! Thanks for using the library. No you're not missing something here. This is a valid use case I hadn't considered.

It is possible to workaround using a custom validator, for example:

from strictyaml import Map, load

class Float(ScalarValidator):
    def validate_scalar(self, chunk):
        return float(chunk.contents)

schema = Map(
    {
        "number_one": Float(),
        "not_a_number": Float(),
    }
)

yaml_string = (
    "number_one: 1.0\n"
    "not_a_number: nan"  # doesn't have a dot at the beginning... wasn't sure if this was intentional?
)

load(yaml_string=yaml_string, schema=schema)

I might change the float library to accept "inf" and "nan", but I'd like to verify first that this is some sort of standard.

danilomendesdias commented 4 years ago

Very nice! In terms of floating point representation, nan and inf follow the IEEE 754 standard if that is what you are looking for. On the yaml side, I don't know much more than what it was shown in the previous link I sent.

For now, I'll use your suggestion but if I can help with that by sending a PR, just let me know.

Thank you again for answering so quickly.

crdoconnor commented 4 years ago

IEEE 754 sounds good. I seem to have a hard time trouble finding a grammar, though. I'm a bit wary that python and YAML 1.2 might be doing their own thing (e.g. python parses "InfINITy" for some bizarre reason), and I'd rather use something which is a bit more standard.

I'd love it if you contributed a pull request! Thank you

danilomendesdias commented 4 years ago

I only noticed now that you mentioned the possibility of my .nan being a typo, but in fact it was intentional. I think the yaml specification is a little confusing, but it is there: https://yaml.org/spec/1.2/spec.html.

crdoconnor commented 4 years ago

Your PR basically looks perfect (thanks for accommodating my custom testing framework. it's appreciated!), but I'm still somewhat uncomfortable about this ".inf" thing. I'm sort of ok with accepting ".inf" as an input purely for compatibility with YAML, but I'd rather the canonical representation was something that was more of a widely used standard and not a quirk of YAML 1.2 (i.e. serialization outputs something else and the parser will accept both that and "nan", for example).

I checked the python json parser and it seems it accepts "NaN", "-Infinity" and "Infinity" although I'm suspicious that might be a python "add on" and not actually part of the specification.

crdoconnor commented 4 years ago

It seems that the IEEE 754 standard doesn't really have an opinion on text representation of inf and nan (which is why everybody seems to do it differently).

I'm leaning towards thinking that it should accept ".inf" (for YAML 1.2), "inf" (for numpy/python), "Infinity" in the parser and should spit out "inf". What do you think? (assuming you have an opinion at all :))

danilomendesdias commented 4 years ago

My [personal] opinion is: .nan instead of just nan is horrible. But before strictyaml, I was using pyaml and I they accept it and I also found this specification. That's why I started to think that's the "actual standard". Supporting both .nan and nan seems to be a good strategy (I saw some people complaining in pyaml repo about nan being parsed as string).

I think the sensitive decision is more related to the output. I could suggest a way of making that configurable, but sometimes it is just better to wait and see if someone else is really interested in the output specification instead of implementing each possible workflow.

[I used nan as example, but it also applies to infinity representations]

crdoconnor commented 4 years ago

Yes, I agree it's disgusting :) I'm almost inclined to say that it shouldn't even accept ".nan", but I know if we do that then people who have YAML 1.2 documents they want to parse will complain.

I don't think it ought to be configurable on the output. If somebody really cares, IMHO they should write their own serializer/deserializer. I think on balance I'm happy with "nan" and "inf" and "-inf" as an output. There's no real standard but this seems to be what numpy/python view as legitimate.

danilomendesdias commented 4 years ago

Yes, if we lock the inputs to nan, it will be just a matter of time to have someone opening an issue about that here.

Just to summarize, can I update the code to follow this?

NaN

acceptable inputs: .nan, .NaN, .NAN, nan, NaN, NAN
ouput: nan

Positive infinity

acceptable inputs: .inf, .Inf, .INF, inf, Inf, INF, +.inf, +.Inf, +.INF, +inf, +Inf, +INF
ouput: inf

Negative infinity

acceptable inputs: -.inf, -.Inf, -.INF, -inf, -Inf, -INF
output: -inf

crdoconnor commented 4 years ago

Yes, please do. I'll merge and release a new version on pypi once you're done.

Thank you for doing this.

danilomendesdias commented 4 years ago

Ok, nice! I think I can finish that tomorrow, ok?

Thank you for the discussion.

crdoconnor commented 4 years ago

Yep no rush.

On Mon, 10 Aug 2020, 17:03 Danilo Silva, notifications@github.com wrote:

Ok, nice! I think I can finish that tomorrow, ok?

Thank you for the discussion.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/crdoconnor/strictyaml/issues/121#issuecomment-671443585, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOJKNKYKM73PWXCPKCF7GTSAAK5HANCNFSM4PXZPZ6Q .

shoogle commented 4 years ago

I would say accept only INF and NaN (think this form is least likely to be mistaken for a standard string). StrictYAML aims to be fully roundtripable, which means there should only be one valid representation. It is a restricted subset of YAML, so no need to support every variation. You could check for other variations and print an error saying to use INF or NaN.

danilomendesdias commented 4 years ago

@shoogle I understand your concern with being strict, but this is about following existing standards, right? Is there any standard that mentions INF and NaN as the right ones? It sounds a little arbitrary to me. And don't get me wrong, I really would like an existing one :(

Yes, .nan is weird, but people are using that because these confusing YAML specifications. It is not their fault and that's totally different from having badly formatted files. Besides that, if you can accept 1.000, 1., +1E+0, 1.e000 etc, why could you not accept different things for nan and inf?

shoogle commented 4 years ago

@danilomendesdias, StrictYAML is a new standard, hence it can make up its own rules. People who use .nan are not using StrictYAML; they are using YAML.

However, you make a valid point about 1.000, 1., +1E+0, 1.e000, etc. Maybe we should not allow those either! :wink:

Seriously though, I suppose some variation is unavoidable, at least for floating point numbers. I guess it's OK to allow common variants as long as the output is always consistent. My vote is to make the output INF and NaN. Sure this is arbitrary, but no more so than any other option.

Remember though that the allowed input has to be specified explicitly for the sake of people writing parsers in other languages. I think it would be simpler to allow the following regex:

INF: +?.?[Ii][Nn][Ff] (optional +, followed by optional ., followed by mandatory inf which is case-insensitive.)
NaN: +?.?[Nn][Aa][Nn] (optional ., followed by mandatory nan which is case-insensitive.)

crdoconnor commented 4 years ago

StrictYAML aims to be fully roundtripable, which means there should only be one valid representation.

Actually, come to think of it, it should still roundtrip ok. Put .inf in and change something unrelated in a different place in the document, then you should get .inf out. However, if you're starting afresh on a new document serializing a floating point number you just came up with that's infinity, it'll pick inf.

crdoconnor commented 4 years ago

My vote is to make the output INF and NaN

Is there a reason you want these in particular to be the default versions? (i.e. as_document({"x": float(nan)}).as_yaml() == "x : NaN").

shoogle commented 4 years ago

@crdoconnor, if somebody sees the document on its own (i.e. without the schema) then I think that INF and NaN are more obviously special values, whereas inf and nan might be mistaken for ordinary strings. After all, "nan" is actually a word (it is slang for "grandmother" British English, and an alternative spelling of "naan bread" in American English).

grahamgower commented 3 years ago

The yaml spec says .inf, -.inf and .nan. As bizarre as these are, it seems like a very bad idea to output anything that doesn't conform to the spec. inf is interpreted as a string by other yaml parsers, so it's fairly clear this is likely to cause real problems.

crdoconnor / strictyaml

Floating point issue with NaN values #121