ProtixIT / dataclass-binder

Python library to bind TOML data to dataclasses in a type-safe way
MIT License
13 stars 2 forks source link

Add support for Enums #29 #44

Open atomicptr opened 9 months ago

atomicptr commented 9 months ago

I like to use enums in Python so this seemed important to me, I added a few ways to use enums to the tests.

The only thing missing would be being able to use enums as keys for dicts that seemed a lot more complicated so I didn't implement this. Although right now I think only strings and ints can be used for that either way.

I was waiting in line for a coffee for like 30m so I hacked this together on my phone, I hope the tests actually pass 😅

mthuurne commented 9 months ago

Thanks for your contribution!

I'll look into getting CI to run on pull requests as well. You're the first, so this part of the infrastructure hadn't been tested before.

It's good to see elaborate test cases as well. One thing that seems to be missing from the tests though is checking the error handling. For example, does the library produce a useful error message if the TOML value isn't a string, or if it is a string but it doesn't exist as a member of the enum?

Should the mapping from TOML string to Python enum member be done case-sensitive or case-insensitive? From your test case, it seems you expect case-insensitive matching, but I don't see anything in the code that implements that.

mthuurne commented 9 months ago

Should the mapping from TOML string to Python enum member be done case-sensitive or case-insensitive? From your test case, it seems you expect case-insensitive matching, but I don't see anything in the code that implements that.

Actually, it seems you're trying to match enum members by their value ("red", 1) rather than by their name (RED, ONE).

I tend to think of the names as the interface of the enum and the value as something that is mainly used for debugging. In that perspective, it would be better to match on the names instead of values. But I don't know whether everyone uses enums in Python like that.

When enum.auto() is used, the value does not have any meaning, so we should be matching those on name. However, I don't think there is any way to determine whether auto() was used through introspection of the Enum subtype. Therefore I think we have to support mapping by name in any case.

Then the question remains whether we also want to support mapping by value. I'm leaning towards "no", to keep things simple. But if there is a compelling use case that can't be handled with mapping by name, I can be persuaded otherwise.

atomicptr commented 9 months ago

Thanks for the review! I'll look into the requested changes later or tomorrow!

Then the question remains whether we also want to support mapping by value. I'm leaning towards "no", to keep things simple. But if there is a compelling use case that can't be handled with mapping by name, I can be persuaded otherwise.

Mapping by name you mean to always expose the key as a string inside the TOML?

One use case I often face and the primary reason why I wanted enum support in the first place is a toml where you have a pre defined set of values (could probably also be solved with typing.Literal) aka

# this only accepts pre defined values 
# I am only consuming this (or maybe also updating) but this is defined by a different app
state = 3
# same concept
mode = "install"

In this case having dataclass binder use values like

state = "OFF"
mode = "INSTALL"

Is not helping me at least. The structure/values of this toml are pre defined I merely don't want to handle strings or ints on my end.

If I am both consumer and producer of the TOML I am also not sure if using the enum key names is useful because if I'd want that I could just name them this way in my enum values IMHO

mthuurne commented 9 months ago

Then the question remains whether we also want to support mapping by value. I'm leaning towards "no", to keep things simple. But if there is a compelling use case that can't be handled with mapping by name, I can be persuaded otherwise.

Mapping by name you mean to always expose the key as a string inside the TOML?

Yes. Let's modify the color enum from the unit test to make the name and value more distinct:

class Color(Enum):
    RED = "#f00"
    GREEN = "#0f0"
    BLUE = "#00f"

In this case, matching by name would mean mapping the TOML string "RED" (or also "red" if we map case-insensitive) to Color.RED. Matching by value would mean mapping the TOML string "#f00" to Color.RED.

One use case I often face and the primary reason why I wanted enum support in the first place is a toml where you have a pre defined set of values (could probably also be solved with typing.Literal) aka

Adding support for typing.Literal could be useful; I've filed that under #45.

# this only accepts pre defined values 
# I am only consuming this (or maybe also updating) but this is defined by a different app
state = 3
# same concept
mode = "install"

In this case having dataclass binder use values like

state = "OFF"
mode = "INSTALL"

Is not helping me at least. The structure/values of this toml are pre defined I merely don't want to handle strings or ints on my end.

Ah, this is a use case I hadn't considered: if the TOML format is fixed and uses integers or strings that are not valid Python identifiers, then indeed matching by value is useful.

Maybe we could use IntEnum, IntFlag and StrEnum to force matching by value? That would be similar to how those enumeration subtypes operate within Python itself.

For example with this code:

from dataclasses import dataclass
from enum import Enum, IntEnum, auto

class Verbosity(Enum):
    QUIET = auto()
    NORMAL = auto()
    DETAILED = auto()

class IntVerbosity(IntEnum):
    QUIET = 0
    NORMAL = 1
    DETAILED = 2

@dataclass
class Settings:
    verbose_name: Verbosity
    verbose_value: IntVerbosity

The corresponding TOML would look like this:

verbose-name = 'quiet'
verbose-value = 2

Note that StrEnum was introduced in Python 3.11, so it won't be available in 3.10. However, if we support case-insensitive name matching, that would probably be sufficient in the majority of cases. It is the integer case that benefits most from value matching and IntEnum and IntFlag have been around since the enum module got added to the standard library.

If I am both consumer and producer of the TOML I am also not sure if using the enum key names is useful because if I'd want that I could just name them this way in my enum values IMHO

There are scenarios in which name matching is useful:

Since both name matching and value matching are useful, the question becomes how to support both while keeping the behavior of dataclass-binder predictable to the user. Would using IntEnum, IntFlag and StrEnum to select value matching be a workable solution for you?

atomicptr commented 9 months ago

That would indeed work for me and I agree having the ability to use keys seems very useful! I'm just not sure yet as to what is more intuitive for the user because I don't think using the specialized sub classes to differentiate between keys/values is something I'd expect, they also have some other implications in their usage that might be unwanted. The values of auto are also highly predictable (counting upwards starting with 1 or in case of string enums just the key in lower case), probably wouldn't want to use this when you actively care about the values (consuming foreign toml).

I think we should decide for one of these two cases to be the default (key vs value) and have the other available through either a different base class or a decorator making this choice deliberate.

For example if values where default (names are just examples)

class Color(Enum, BindEnumKey):
    RED = "#f00"
    #.... 

# or 

@bind_enum_key(case_sensisitive=True) 
class Color(Enum):
    RED = "#f00"
    #.... 

I do like being able to configure the behavior via decorators.

We could also force users to annotate all enums and reject them if they haven't defined this behavior explicitly

@bind_enum(as="key", case_sensisitive=True) 
class Color(Enum):
    RED = "#f00"
    #.... 

Downside is that this is special behavior for dataclass binder, although some way of annotating this seems sensible 🤔

mthuurne commented 9 months ago

That would indeed work for me and I agree having the ability to use keys seems very useful! I'm just not sure yet as to what is more intuitive for the user because I don't think using the specialized sub classes to differentiate between keys/values is something I'd expect, they also have some other implications in their usage that might be unwanted.

With IntEnum/IntFlag/StrEnum, the user communicates that the values are important. So I think matching by value when these types are used is likely to produce the behavior the user is looking for.

The values of auto are also highly predictable (counting upwards starting with 1 or in case of string enums just the key in lower case), probably wouldn't want to use this when you actively care about the values (consuming foreign toml).

Indeed: while the values are predictable, by using auto() the user expresses that they don't really care about the values and that the values might change in future versions of the software. So in those situations matching by name would make more sense.

I think we should decide for one of these two cases to be the default (key vs value) and have the other available through either a different base class or a decorator making this choice deliberate. ... Downside is that this is special behavior for dataclass binder, although some way of annotating this seems sensible 🤔

So far I've managed to avoid introducing any new syntax for dataclasses in dataclass-binder. I'm not sure this is feasible forever, but I'd like to keep it to a minimum:

In some cases the behavior comes naturally, such as dataclass fields without defaults being mandatory in TOML. However, in the case of selecting name versus value matching for enumerations, I don't think a very intuitive solution exists. I hope that "IntEnum/IntFlag/StrEnum for value matching" is at least intuitive enough that people will have an easier time remembering it after reading the docs.

I'm not certain that using conventions rather than explicit annotations is the right way to go, but it is consistent with the way the library has been designed so far, so I'd like to give it a try.

atomicptr commented 9 months ago

So far I've managed to avoid introducing any new syntax for dataclasses in dataclass-binder. I'm not sure this is feasible forever, but I'd like to keep it to a minimum:

* to reduce the amount of new syntax that people who write new dataclasses have to learn

* to increase the chance of existing dataclasses working with `dataclass-binder` with no or minimal changes

* to make the dataclasses easier to read by people who don't necessarily know all the details of the binding process

* to have the option to support other dataclass-like solutions in the future (such as `attrs`)

In some cases the behavior comes naturally, such as dataclass fields without defaults being mandatory in TOML. However, in the case of selecting name versus value matching for enumerations, I don't think a very intuitive solution exists. I hope that "IntEnum/IntFlag/StrEnum for value matching" is at least intuitive enough that people will have an easier time remembering it after reading the docs.

I'm not certain that using conventions rather than explicit annotations is the right way to go, but it is consistent with the way the library has been designed so far, so I'd like to give it a try.

I actually very much appreciate this.

I pushed the changes, StrEnum, IntEnum and IntFlag all inherit ReprEnum so this seemed like a good candidate

Edit: Also added some more tests which test the case insensitivity working and some error cases

atomicptr commented 9 months ago

@mthuurne Any more changes you would like to see?

mthuurne commented 9 months ago

Sorry for the delay.

I pushed the changes, StrEnum, IntEnum and IntFlag all inherit ReprEnum so this seemed like a good candidate

ReprEnum was introduced in 3.11, while dataclass-binder supports 3.10 as well, so we need an alternative code path for 3.10. I created one in commit 5c32aa35079b53a938e541ff127039552d4e489c, maybe you could cherry-pick that?

I couldn't get CI to work on this PR, but it should now be enabled for future PRs. So I ran the tests locally and got one failing test case and two lines missing test coverage. I'll add comments on the specific lines.

atomicptr commented 8 months ago

Sorry for the delay, I was a bit busy.

Edit: Fixed all the open changes!

atomicptr commented 7 months ago

Hey! Thanks for the review, I have added the requested changes :)

wyleung commented 4 months ago

@mthuurne Is there a timeline when this PR can be merged? As far I can see, there will be no blockers curently?