Case-insensitive parsing and validation.

robertodr commented 5 years ago

As pointed out by @stigrj it would be good to not force case-sensitivity upon the users. I see some options:

Do it before we parse. This would mean the programmer using the library has to normalize the case of the input file and the validation template has to adhere to this choice. That is, if normalization is uppercase (lowercase), then validation template use uppercase (lowercase).
As point 1, but we add an option to the api.lex (and expose it in the CLI). Something like --case upper (--case lower)
We let the grammar take care of case normalization, by adding a parse action to the various tokens. In case we change the parsing library (from pyparsing to lark, for example) this will not carry through.
We do case normalization at the validation level. I think this is the most invasive option of all.

stigrj commented 5 years ago

:+1:

bast commented 5 years ago

I agree with 2.

robertodr commented 5 years ago

Noting this here, as it might be helpful: https://docs.python.org/3/library/stdtypes.html#str.casefold

robertodr commented 5 years ago

Does it make sense to you that the case of quoted strings should be preserved? The way Getkw works, you can have quoted and unquoted strings. The latter can only look like valid C user tokens (first character can be an underscore or an alphabetic chars, followed by more underscores and alphanumeric chars) whereas the former can be almost anything (unicode too, I think) Most importantly, quoted string can contain slashes, so that this is the format one should use to specify file paths. I'd expect file paths to always be intended to be case-sensitive, hence my question.

bast commented 5 years ago

I agree that changing case of paths is dangerous. Also until now I did not realize that we wanted to possibly normalize everything - somehow I imagined that we only normalize keys but not values.

robertodr commented 5 years ago

Yeah, it's thorny. Are all of these acceptable?

Functional = B3LYP
functional = b3lyp
FunCtioNAL = b3LYP

bast commented 5 years ago

Every time I saw case insensitivity as user (e.g. Fortran, old OS X file system (?)), I thought: "wow what a terrible idea!" - but I guess we can offer the possibility.

bast commented 5 years ago

How will this work:

datapath1 = /home/user/doNotNormalizeMe
datapath2 = "/home/user/doNotNormalizeMe"

robertodr commented 5 years ago

The current grammar will not be able to parse the former.

bast commented 5 years ago

How about we only offer ignore-casing of keys? But we preserve values?

stigrj commented 5 years ago

How about this

  a_bool = True
  another_bool = false

Is it defined in the grammar what to accept?

robertodr commented 5 years ago

@stigrj All of these are valid for booleans as caseless literals already.

truthy = ["TRUE", "ON", "YES", "Y"]
"""List[str]: List of true-like values."""
falsey = ["FALSE", "OFF", "NO", "N"]
"""List[str]: List of false-like values."""

@bast I think I would still like b3lyp to be as valid as B3LYP though. Regarding floating point numbers, only caseless E is recognized now. D can be arranged too, but I'd rather not encourage using D...

bast commented 5 years ago

Good point about the booleans, this is a frequent papercut. Also scientific notation floats should allow ignore-case "D" and "E".

bast commented 5 years ago

Yes, b3lyp example is convincing. Also I would like to type cc-pvdz. Some half baked ideas:

we allow to ignore ignore case in the template for some sensitive nodes (sounds not elegant)
we add "path" to the set of allowed types and we do not touch "paths"

stigrj commented 5 years ago

Yes, I think the most realistic use case is to allow both

functional = b3lyp
functional = B3LYP

The keywords can always be set to follow a convention, like in MRChem:

sections: capitalized camel-case
keywords: lowercase with underscores

It's not that easy with values

bast commented 5 years ago

About the "NO" interpreted as falsey - we had a nasty bug in a web thingy where we were parsing country codes for Nordic projects and NO (Norway) was interpreted as false :-)

robertodr commented 5 years ago

I think paths are the only case where we would never want to modify the case of the string. Are there any other examples? I feel adding a Path and List[Path] type could be the elegant and easy way out of this:

everything case-sensitive by default (except booleans and floats in scientific notation)
case-sensitive with case normalization selected by the user. Paths must be given as quoted strings, types must be declared accordingly. Case normalization is only done on values that are strings At what level I still have to figure out. Probably needs two passes: one before going into the lexer proper, we read the file and normalize skipping all singly and doubly quoted strings, and another when fixing defaults, to make sure all strings are normalized.

@bast I was bitten by the same two days ago (in the randomized testing of lists of unquoted strings) Should we only allow true and false? I was trying to reproduce the original grammar: https://github.com/dev-cafe/libgetkw/blob/master/Python/getkw.py#L757

robertodr commented 5 years ago

Bonus of having the Path type. Predicates to check that a file already exists are trivial: from pathlib import Path; Path(value).exists()

stigrj commented 5 years ago

How about adding actions to the keywords similar to predicates, so that you can get normalized values where you want them, like functionals and basis sets

keyword:
  - name: functional
    type: str
    predicates:
      - 'len(value) < 20'
    actions:
      - 'value = value.lower()'

A Path type still sounds like a good idea!

robertodr commented 5 years ago

So much power... I like the idea, but I don't think it scales well (I need to do it for all keywords) I need to think about it more.

bast commented 5 years ago

After some more thinking I still like the idea of a Path type. Strings and paths are really distinct types.

As to "NO": I would not miss yes/no. I am still unsure whether I would miss on/off.

bast commented 5 years ago

Another thing that I would not like to case-normalize are checksums/hashes.

robertodr commented 5 years ago

Let's split the str type into two: case-sensitive and case-insensitive. The latter would have the case normalization action on its value "embedded". The Path type is a valuable addition regardless of this discussion. I'll open a new issue (and possibly have a PR ready later this evening)

bast commented 5 years ago

I like the suggestion of having two str types. This makes everything less surprising.

robertodr commented 5 years ago

Name suggestions for the types? istr and sstr?

bast commented 5 years ago

How about str and str_ignorecase?

robertodr commented 5 years ago

I have written some preliminary design docs. Let me know if you agree. See also my comment on #55 as to why, contrary to my initial enthusiam, we do not need/want a path type.

Case-sensitivity is a sensitive issue. It is desirable to have the following two examples be equally valid:

$ functional = b3lyp
$ Functional = B3LYP

We can ensure case-insensitive comparisions by preliminarly normalizing the case of the input. The programmer would have the choice of normalizing to uppercase or lowercase. Hence, the above examples would both decay to either:

$ functional = b3lyp

when normalizing to lowercase, or:

$ FUNCTIONAL = B3LYP

when normalizing to uppercase. Note how normalization happens on the left-hand (section and keyword labels) and the right-hand side (values). This means that the validation specifications in the template.yml file need to conform to whatever the chosen normalization is, both for keyword/section names and possible string defaults. Note that for booleans case normalization is not problematic: their type will be coerced upon reading in, be it through tokenization of the input file or loading of the template.yml.

There are however cases where normalizing the case is not desirable. Two notable examples are filesystem paths and checksums. For these cases, the case from input needs to be respected. Once again, note that the case of default values in the template.yml is completely up to the programmer. No normalization will ever be performed by the library.

We offer two string types:

Case sensitive: str. No case normalization will be performed for values of this type.
Case insensitive: str_ignorecase. For values of this type, the case will be normalized - either upper or lower - according to the request of the programmer. If no case normalization is requested, this string type decays to str.

stigrj commented 5 years ago

Do we really need/want insensitive keywords and sections? Wouldn't it be more sensible to only normalize the str_ignorecase values and keep everything else, instead of normalizing everything and keep the sensitive str? It feels a bit strange that if you want to accept case insensitive strings from the user you have to declare it as str_ignorecase and at the same time make all section and keyword names insensitive.

How about three types:

str: case sensitive
str_lower: value will be normalized to lower
str_upper: value will be normalized to upper

bast commented 5 years ago

Thanks for convincing arguments about path. Also I agree that we should perhaps only allow ignorecase/lower/upper on the values but not the keys. The reason is that later the code wants to fish out the values using well defined keys and perhaps we don't want another degree of freedom there. I like the suggestion with str, str_lower, str_upper - this way the code knows what they get and do not need to normalize inside again just to be sure.

bast commented 5 years ago

Discussion with Stig and Roberto:

keys (in other words keywords and section names) are normalized to lowercase and the dictionary is then accessed with lowercase keys
values: we postpone this decision until we decide about actions later since actions would allow lower() or upper()

dev-cafe / parselglossy

Case-insensitive parsing and validation. #51