Open danilomendesdias opened 4 years ago
Hi! Thanks for using the library. No you're not missing something here. This is a valid use case I hadn't considered.
It is possible to workaround using a custom validator, for example:
from strictyaml import Map, load
class Float(ScalarValidator):
def validate_scalar(self, chunk):
return float(chunk.contents)
schema = Map(
{
"number_one": Float(),
"not_a_number": Float(),
}
)
yaml_string = (
"number_one: 1.0\n"
"not_a_number: nan" # doesn't have a dot at the beginning... wasn't sure if this was intentional?
)
load(yaml_string=yaml_string, schema=schema)
I might change the float library to accept "inf" and "nan", but I'd like to verify first that this is some sort of standard.
Very nice! In terms of floating point representation, nan
and inf
follow the IEEE 754 standard if that is what you are looking for. On the yaml side, I don't know much more than what it was shown in the previous link I sent.
For now, I'll use your suggestion but if I can help with that by sending a PR, just let me know.
Thank you again for answering so quickly.
IEEE 754 sounds good. I seem to have a hard time trouble finding a grammar, though. I'm a bit wary that python and YAML 1.2 might be doing their own thing (e.g. python parses "InfINITy" for some bizarre reason), and I'd rather use something which is a bit more standard.
I'd love it if you contributed a pull request! Thank you
I only noticed now that you mentioned the possibility of my .nan
being a typo, but in fact it was intentional. I think the yaml specification is a little confusing, but it is there: https://yaml.org/spec/1.2/spec.html.
Your PR basically looks perfect (thanks for accommodating my custom testing framework. it's appreciated!), but I'm still somewhat uncomfortable about this ".inf" thing. I'm sort of ok with accepting ".inf" as an input purely for compatibility with YAML, but I'd rather the canonical representation was something that was more of a widely used standard and not a quirk of YAML 1.2 (i.e. serialization outputs something else and the parser will accept both that and "nan", for example).
I checked the python json parser and it seems it accepts "NaN", "-Infinity" and "Infinity" although I'm suspicious that might be a python "add on" and not actually part of the specification.
It seems that the IEEE 754 standard doesn't really have an opinion on text representation of inf and nan (which is why everybody seems to do it differently).
I'm leaning towards thinking that it should accept ".inf" (for YAML 1.2), "inf" (for numpy/python), "Infinity" in the parser and should spit out "inf". What do you think? (assuming you have an opinion at all :))
My [personal] opinion is: .nan
instead of just nan
is horrible. But before strictyaml, I was using pyaml and I they accept it and I also found this specification. That's why I started to think that's the "actual standard". Supporting both .nan
and nan
seems to be a good strategy (I saw some people complaining in pyaml repo about nan
being parsed as string).
I think the sensitive decision is more related to the output. I could suggest a way of making that configurable, but sometimes it is just better to wait and see if someone else is really interested in the output specification instead of implementing each possible workflow.
[I used nan
as example, but it also applies to infinity representations]
Yes, I agree it's disgusting :) I'm almost inclined to say that it shouldn't even accept ".nan", but I know if we do that then people who have YAML 1.2 documents they want to parse will complain.
I don't think it ought to be configurable on the output. If somebody really cares, IMHO they should write their own serializer/deserializer. I think on balance I'm happy with "nan" and "inf" and "-inf" as an output. There's no real standard but this seems to be what numpy/python view as legitimate.
Yes, if we lock the inputs to nan
, it will be just a matter of time to have someone opening an issue about that here.
Just to summarize, can I update the code to follow this?
NaN
.nan
, .NaN
, .NAN
, nan
, NaN
, NAN
nan
Positive infinity
.inf
, .Inf
, .INF
, inf
, Inf
, INF
, +.inf
, +.Inf
, +.INF
, +inf
, +Inf
, +INF
inf
Negative infinity
-.inf
, -.Inf
, -.INF
, -inf
, -Inf
, -INF
-inf
Yes, please do. I'll merge and release a new version on pypi once you're done.
Thank you for doing this.
Ok, nice! I think I can finish that tomorrow, ok?
Thank you for the discussion.
Yep no rush.
On Mon, 10 Aug 2020, 17:03 Danilo Silva, notifications@github.com wrote:
Ok, nice! I think I can finish that tomorrow, ok?
Thank you for the discussion.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/crdoconnor/strictyaml/issues/121#issuecomment-671443585, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOJKNKYKM73PWXCPKCF7GTSAAK5HANCNFSM4PXZPZ6Q .
I would say accept only INF
and NaN
(think this form is least likely to be mistaken for a standard string). StrictYAML aims to be fully roundtripable, which means there should only be one valid representation. It is a restricted subset of YAML, so no need to support every variation. You could check for other variations and print an error saying to use INF
or NaN
.
@shoogle I understand your concern with being strict, but this is about following existing standards, right?
Is there any standard that mentions INF
and NaN
as the right ones? It sounds a little arbitrary to me. And don't get me wrong, I really would like an existing one :(
Yes, .nan
is weird, but people are using that because these confusing YAML
specifications. It is not their fault and that's totally different from having badly formatted files. Besides that, if you can accept 1.000
, 1.
, +1E+0
, 1.e000
etc, why could you not accept different things for nan
and inf
?
@danilomendesdias, StrictYAML is a new standard, hence it can make up its own rules. People who use .nan
are not using StrictYAML; they are using YAML.
However, you make a valid point about 1.000
, 1.
, +1E+0
, 1.e000
, etc. Maybe we should not allow those either! :wink:
Seriously though, I suppose some variation is unavoidable, at least for floating point numbers. I guess it's OK to allow common variants as long as the output is always consistent. My vote is to make the output INF
and NaN
. Sure this is arbitrary, but no more so than any other option.
Remember though that the allowed input has to be specified explicitly for the sake of people writing parsers in other languages. I think it would be simpler to allow the following regex:
+?.?[Ii][Nn][Ff]
(optional +
, followed by optional .
, followed by mandatory inf
which is case-insensitive.)+?.?[Nn][Aa][Nn]
(optional .
, followed by mandatory nan
which is case-insensitive.)StrictYAML aims to be fully roundtripable, which means there should only be one valid representation.
Actually, come to think of it, it should still roundtrip ok. Put .inf in and change something unrelated in a different place in the document, then you should get .inf out. However, if you're starting afresh on a new document serializing a floating point number you just came up with that's infinity, it'll pick inf.
My vote is to make the output INF and NaN
Is there a reason you want these in particular to be the default versions? (i.e. as_document({"x": float(nan)}).as_yaml() == "x : NaN").
@crdoconnor, if somebody sees the document on its own (i.e. without the schema) then I think that INF
and NaN
are more obviously special values, whereas inf
and nan
might be mistaken for ordinary strings. After all, "nan" is actually a word (it is slang for "grandmother" British English, and an alternative spelling of "naan bread" in American English).
The yaml spec says .inf
, -.inf
and .nan
. As bizarre as these are, it seems like a very bad idea to output anything that doesn't conform to the spec. inf
is interpreted as a string by other yaml parsers, so it's fairly clear this is likely to cause real problems.
Hi, there. Thanks for providing this nice library!
I found myself in a problem here that is related to the following lines:
and I got this:
According to https://yaml.org/spec/1.2/spec.html#id2804092,
.nan
,.inf
or-.inf
are part of the canonical form of the floating point representation.(In fact, I would like to do that using
dirty_load
, as I need a readable representation for my matrices (I'm usingSeq(Seq(Float))
), but that's another story, asdirty_load
seems to be just another way of callinggeneric_load
withallow_flow_style
.)Am I missing something here or is there a workaround for it?
Thank you