Closed ice-tong closed 1 year ago
Hey @ice-tong! Thank you for the kind words. :)
You're totally right that _types_of_iterable
can run into performance issues when run on nested structures or iterables with many elements. This is currently and unfortunately a limitation of the implementation.
When a list with many elements is passed to a function, one must inspect all elements of the list to determine the right element type. As far as I can tell, there is no cheaper way to determine this.
One way of approaching this issue is, for large lists, say n >= 1000
, to only inspect a subset of the elements. For example, the first 100, last 100, and 100 randomly chosen elements sounds like a reasonable approach. Of course, the type might then not be right anymore.
Do you have any thoughts on this?
Hi @wesselb, I agree that there is no cheaper way to accurately get the type of a Python Object. I also don't have any good ideas for this right now. š¤£
Let's leave this issue open, because it is certainly a performance issue that at some point will need addressing!
Hi @wesselb, I want to share with you some possible tricks to optimize the type_of
speed.
In the following big list test, we can use the singleton Type since a lot of instance creation in type_of
can be expensive. And the hash cache of plum type can also bring better speed.
import time
from contextlib import contextmanager
from plum import type_of
from plum.type import TypeMeta
from plum.type import Type as plum_Type
from plum.parametric import Tuple as plum_Tuple
import cProfile
big_list = [(1, 2) for _ in range(10000*100)]
start_t = time.time()
print(type_of(big_list))
base_t = time.time() - start_t
print(f'Big list cost: {base_t} s')
@contextmanager
def singleton(meta_cls):
origin = meta_cls.__call__
meta_cls._instances = {}
def __call__(cls, *args, **kwargs):
assert not kwargs
key = (cls, args)
if key not in cls._instances:
cls._instances[key] = origin(cls, *args, **kwargs)
return cls._instances[key]
meta_cls.__call__ = __call__
yield
meta_cls.__call__ = origin
with singleton(TypeMeta):
start_t = time.time()
print(type_of(big_list))
t_1 = time.time() - start_t
print(f'Big list with singleton type cost: {t_1} s')
print(f'speed up: {(base_t - t_1) / base_t}')
@contextmanager
def hash_cache(hashable_ty):
hash_core = hashable_ty.__hash__
hashable_ty._hash = None
def __hash__(self):
if self._hash is None:
self._hash = hash_core(self)
return self._hash
hashable_ty.__hash__ = __hash__
yield
hashable_ty.__hash__ = hash_core
with hash_cache(plum_Tuple), hash_cache(plum_Type), singleton(TypeMeta):
start_t = time.time()
print(type_of(big_list))
t_2 = time.time() - start_t
print(f'Big list with singleton type and hash cache cost: {t_2} s')
print(f'speed up: {(base_t - t_2) / base_t}')
cProfile.run('type_of(big_list)', filename='plum_type_of.prof')
we got 40+% speed up:
List[Tuple[builtins.int, builtins.int]]
Big list cost: 5.383634090423584 s
List[Tuple[builtins.int, builtins.int]]
Big list with singleton type cost: 3.3826003074645996 s
speed up: 0.371688296297556
List[Tuple[builtins.int, builtins.int]]
Big list with singleton type and hash cache cost: 2.8426778316497803 s
speed up: 0.47197789004525037
The type_of
use multiple dispatche also make performance issue. Use the following type_of
implementation got 15% speed up.
def type_of(obj):
if type(obj) is list:
return List(_types_of_iterable(obj))
elif type(obj) is tuple:
return Tuple(*(type_of(x) for x in obj))
elif type(obj) is dict:
return Dict(_types_of_iterable(obj.keys()), _types_of_iterable(obj.values()))
else:
return ptype(type(obj))
Hey @ice-tong,
Those are some clever optimisationsāhalving the runtime of type_of
is a big improvement! Caching objects and hash values seems like a good way to reduce unnecessary overhead. Some use of Plum loads types dynamically, which means that the hash of a type might suddently change, but other than that I don't foresee any issues.
The type_of use multiple dispatche also make performance issue. Use the following type_of implementation got 15% speed up.
Currently, multiple dispatch is used to make type_of
extendable. This is an example of where that could be useful. If we were to remove multiple dispatch from type_of
, then that would not be possible anymore. There hence seems to be a speedāfunctionality trade-off here.
@wesselb I'm late to this game, but there's a pretty large body of work around describing this issue, which maybe plum can take advantage of? Two things: 1. runtime typechecks of deeply nested generics, and 2. using this for dispatch
classes
, where @sobolevn has a dispatch-esque ad hoc polymorphism setup. It uses __instance_check__
dunders to patch into the isinstance()
system, such that generics are allowed (both for general dispatch and dedicated Supports
typeclasses). Loads of boilerplate to so this, but it looks like the phantom-types
library can be used to clean up the typing, dev-side. I'm probably over-simplifying here, but all this is to say, it was @sobolevn that led me to find beartype
, which has a pretty phenomenal way of doing O(1)
nested container type-checking at runtime, in a fully PEP-compliant way.
One way of approaching this issue is, for large lists, say n >= 1000, to only inspect a subset of the elements. For example, the first 100, last 100, and 100 randomly chosen elements sounds like a reasonable approach. Of course, the type might then not be right anymore.
It's effectively a robust version of doing this, just with @leycec's incredible obsession for speed verification (and puns :sweat_smile:)
pydantic-core
(written in Rust) will provide arbitrary validation without BaseModel
, pushing pydantic more toward being a true runtime type-checker (with specific love for JSONSchema). SO... can we take advantage of this runtime-checker deluge for some DRY in the dispatch community? As I understand your source code (very briefly, I'll add), you effectively created your own runtime type-checker with special techniques for handling concrete generics --- _something that @leycec has spent a lot of time generalizing. I've noticed that most of the dispatching solutions basically need a runtime typechecker, but end up pretty much writing their own each time, no? I actually spent a bit of time working on the theoretical side of this problem (a beartype-based dispatcher I lovingly wanted to call bearpatch). Since we really only get
True
vs.False
out if theinsinstance
-likeis_bearable()
function, I was working on a dispatcher that used a Trie to compile a sort of ad-hoc generic type hierarchy that could determine a dispatch for covariant generic subtypes in, what,O(log n)
? Ish? But this was a nightmare and mistake to undertake alone, and so I found this thread while considering forkingplum
to replace the typechecking guts with a frankenbeartype monstrosity...
So, I realize this is a massive can of worms I'm opening in terms of the horrifying way python treats runtime types.
(e.g. see this glorious manifesto of a rollercoaster that will dig into the hidden bowels of the mechanics and politics of simply trying to implement a beartype-able
@overloads
system, which in fact directly mentions NOT doing dispatching, since "other libraries can do this"... and yet "other libraries" all suffer from the same need to check nested containers! :sweat_smile:
BUT I guess I'm just trying to start a broader conversation here about how one might go about unifying the community around a decent approach to dispatch on deeply nested containers that doesn't require reinventing the wheel every two seconds/libraries? :smile:
Hi @tbsexton, I had the same idea as you: using beartype
as the runtime type checker to implement multiple dispatch. But I found that beartype
is not at all fast in some cases and the type checking is not accurate.
In [1]: from typing import List, Tuple
In [2]: from beartype.abby import is_bearable
In [3]: import torch
In [4]: import numpy as np
In [5]: typing_hints = List[Tuple[torch.Tensor, torch.Tensor]]
In [6]: data = [(np.array(1), np.array(2)) for _ in range(10000*100)]
In [7]: %time is_bearable(data, typing_hints)
CPU times: user 23.6 s, sys: 41.2 ms, total: 23.6 s
Wall time: 23.6 s
Out[7]: False
So I dropped this idea. š¤£
As long as we can optimize the time consumption of type_of
as much as possible, I believe that the execution speed of plum can eventually reach an acceptable level. ^_^
shots fired
First up, thanks a metric ton to @tbsexton! But what else would you expect from NIST? Only the best. That's what. Thankfully, ludicrously fast runtime type-checking is in our wheelhouse. Let's see if we can't shed a light into the darkness that is this issue.
Second up, let's re-profile @ice-tong's [horrifying worst-case example that makes my filmy eyes bleed]() for posterity. Gotta set that record straight if we can. To do so, we'll begin with a simpler example that elucidates possibly interesting things:
$ ipython3.10
[ins] In [1]: import beartype
[ins] In [2]: beartype.__version__
Out[2]: '0.11.0' # <-- good. this is good.
[ins] In [3]: from beartype.abby import is_bearable
[ins] In [4]: from beartype.typing import List, Tuple # <-- avoid PEP 585 deprecation warnings, yo.
[ins] In [5]: typing_hints = List[Tuple[str, str]] # <-- so, strings instead of PyTorch tensors.
# So far, so good. First, a plus-sized list of 2-tuples satisfying this hint.
[ins] In [6]: data_good = [(str(i), str(i+1)) for i in range(10000*100)]
[ins] In [7]: %time is_bearable(data_good, typing_hints)
CPU times: user 2.17 ms, sys: 958 Āµs, total: 3.13 ms
Wall time: 3.12 ms
Out[7]: True
# Goodness continues. @beartype yeets. Second, let's try that again. Just 'cause.
[ins] In [8]: %time is_bearable(data_good, typing_hints)
CPU times: user 34 Āµs, sys: 3 Āµs, total: 37 Āµs
Wall time: 39.1 Āµs
Out[8]: True
# Goodness amplifies! @beartype is two orders of magnitude faster the second time
# you invoke it than the first. What gives? Simple. The first time you call is_bearable(),
# @beartype secretly creates and caches a synthetic runtime type-checker specific to
# the exact type hint you passed (i.e., "typing_hints" above). Every subsequent time
# you call is_bearable() with the same type hint, @beartype internally reuses the
# previously cached synthetic runtime type-checker for that type hint.
#
# In complexity terms, is_bearable() exhibits amortized worst-case O(1) time complexity
# with negligible constants. This is good and exactly as advertised.
#
# So far, still good. Next, a plus-sized list of 2-tuples *VIOLATING* this hint.
[ins] In [9]: data_bad = [(i, i+1) for i in range(10000*100)] # <-- so, integers instead of NumPy arrays.
[ins] In [9]: %time is_bearable(data_bad, typing_hints)
CPU times: user 1.03 s, sys: 12.1 ms, total: 1.05 s
Wall time: 1.04 s
Out[9]: False
# *THAT'S TERRIBAD.* Okay, sure. It's not quite the 23.6 seconds terribad that
# @ice-tong exhibited above. But... that's still not great. This should be tens of
# orders of magnitudes faster than it currently is. What gives, @beartype!?!?
#
# The real-world, mostly. In a desperate last-ditch effort to push beartype 0.10.0
# out the door, I implemented is_bearable() in the crudest way possible. I wasn't
# convinced anyone was actually interested in performing their own runtime
# type-checking at the statement level. So, I just shipped the "beartype.abby"
# subpackage out the door as fast as possible. We know how this story ends.
# It would eventually become clear that everyone, in fact, is interested in
# performing their own runtime type-checking at the statement level. Then I
# began facepalming myself repeatedly.
#
# Because laziness prevailed, is_bearable() currently uses the Easier to Ask for
# Permission than Forgiveness (EAFP) approach to detect type hint violations.
# Basically, is_bearable() catches exceptions. Exception handling is fast in
# Python if no exception is raised. As soon as an exception is raised, however,
# exception handling is slooooow. That's what we're seeing above.
#
# This is compounded by the fact that the "data_bad" object is a stupidly large
# list of 2-tuples. So, is_bearable() is not only catching an exception there;
# is_bearable() is catching an exception that has embedded inside of itself a
# string representation of a stupidly large list of 2-tuples. It takes a considerable
# amount of wall-clock time to construct that string representation. In fact,
# almost *ALL* of the 1.03 seconds consumed by that last call to is_bearable()
# was spent constructing a string representation that no one ever actually sees.
# More facepalming.
I'll stop there. This embarrassment must end! It's now evident that there is tangible interest in statement-level runtime type-checking. So, I'll absolutely divert resources to optimizing is_bearable()
for the next beartype 0.12.0
release cycle.
That's right. @leycec always cackles madly at the end of every GitHub post ā and this is no exception. plum
doesn't want to call is_bearable()
or any other existing API of any other existing third-party package, because plum
is doing something fundamentally novel that I have never seen anything else do before.
That's fascinating and ultimately the reason we're all here. So, here are the two things I'm seeing:
plum.type_of()
heuristically reverse-engineers the type hint which an arbitrary object would satisfy were that object to be annotated by that hint. Let's take a moment and appreciate how excitingly insane and non-trivial that problem domain actually is. Because it is. It's exciting. But it's also insane.The core issue is that type_of()
will fail in the general case, because there are a countably infinite number of edge cases that grow monotonically with each new CPython minor version and publication of new PEP standards. No. Seriously. To paraphrase childhood nostalgia: "It's dangerous to go alone."
Preach, old man!
There are many intersecting issues here. The most significant is that PEP-compliant type hints were never designed, intended, or implemented to be used at runtime. Type hints explicitly prohibit introspection by defining metaclasses whose __isinstancecheck__()
and __issubclasscheck__()
dunder methods prohibit trivial introspection via isinstance()
and issubclass()
. So how do you even detect what "sort of thing" a given type hint even is? You kinda don't. Numerous PEP standards even publicly admit that "Runtime is not a concern." If you've ever attempted to introspect a type hint at runtime (and you have; oh, you have), you know this hard truth of which I speak. It's like a parasitic slug attached to the inner lining of your intestinal wall. You know it's there, but you'd rather not publicly talk about it to anyone.
The runtime API of type hints varies across Python versions and PEP standards. Most type hints initially popularized by PEP 484 have since been deprecated by PEP 585. Critically, this includes core type hint factories like typing.List[...]
and typing.Tuple[...]
. Moreover, arbitrary user-defined classes can subclass arbitrarily complex type hints. We refer to these as generics. Critically, this includes both PEP 484- and 585-compliant generics ā whose runtime APIs are completely different "under the hood" despite the superficial similarities in their declarative syntax: e.g.,
from typing import List, Tuple
# This is a PEP 484-compliant generic.
class ListOfStrs484(List[Tuple[str]]): pass
# This is a PEP 585-compliant generic.
class ListOfStrs585(list[tuple[str]]): pass
So, now users can create instances of both ListOfStrs484
and ListOfStrs585
and then pass those instances to plum.type_of()
, which then needs to transparently support both PEP 484- and 585-compliant generics. And that's only the start, really. There are typed named tuples, typed dictionaries, typed literals, typed protocols, and all manner of hideous monstrosities you do not want to try detecting by hand ā let alone handling. Then throw typing_extensions
into the mix as well, which produces objects that have a completely different runtime API from those produced by typing
.
Postponement raises its ugly head, too. How do you resolve PEP 563-postponed type hints under the from __future__ import annotation
pragma? You kinda don't. I mean, @beartype does ā but we're certifiably nutso. Don't do what we do. We set bad examples. Except you kinda have to do what @beartype does, or you can't fully resolve postponed type hints. I see that plum
sorta half-heartedly tries to resolve postponed type hints by trying to pretend that they're just forward references in drag ā but they're not. 'Union[MyClass, int]'
is a valid postponed type hint but not a valid forward reference. Right? To fully resolve the former plum
would need to perform deep call stack introspection coupled with potentially dangerous dynamic evaluation via eval()
. Even pydantic, with its immense girth and corporate sponsorship, just gave up and politely requested that PEP 563 be rolled back from Python 3.10. Now everyone except @beartype pretends that from __future__ import annotation
no longer exists. But... it still exists.
Ambiguity raises its ugly head, too. How do you differentiate between a fixed-length tuple (e.g., an object satisfying typing.Tuple[int, int]
) and a variadic tuple (e.g., an object satisfying tuple[int, ...]
)? You kinda don't. How do you differentiate between a string (e.g., an object satisfying str
) and a string literal (e.g., an object satisfying typing.LiteralString
)? You kinda don't ā not without abstract syntax tree (AST) inspection typically performed by a static type checker, anyway. @beartype will do that in a future version, incidentally.
Sorta. That's not far off, really. Type hints don't suck, per say. They're just sufficiently non-trivial to support at runtime that it's generally not worth anyone's time to actually do so ā unless they're @beartype or typeguard
or another full-blown runtime type-checker, in which case we have no choice but to grit our teeth, hit multiple consecutive shots of home-distilled whisky, clamp down on a leather strap, and brace for incoming pain.
What I'm saying is... plum
really wants to defer as much of its type hint handling to third-part(y|ies) as possible. Let plum
obsessively focus on what plum
excels at, which is multiple dispatch. Let somebody else handle the actual type hints, because ain't nobody got 108 consecutive lifetimes to throw at that shade of nonsense.
@beartype actually does all of the above internally already ā full-blown PEP 563 resolution of postponed type hints, full-blown detection of arbitrary type hints, full-blown iteration of generic superclass trees. You name it and we probably have a private internal API handling it. That's the rub, though. Private. We sorta nebulously suspected that somebody else might want to introspect type hints, but never quite got around to finalizing a public API.
The closest we've come is our upcoming Decidedly Object-Oriented Runtime-checking (DOOR) API, shipping with beartype 0.11.0
in a week or two:
# This is DOOR. It's a Pythonic API providing an object-oriented interface
# to low-level type hints that basically have no interface whatsoever.
>>> from beartype.door import TypeHint
>>> usable_type_hint = TypeHint(int | str | None)
>>> print(usable_type_hint)
TypeHint(int | str | None)
# DOOR hints can be iterated Pythonically.
>>> for child_hint in usable_type_hint: print(child_hint)
TypeHint(<class 'int'>)
TypeHint(<class 'str'>)
TypeHint(<class 'NoneType'>)
# DOOR hints support equality Pythonically.
>>> from typing import Union
>>> usable_type_hint == TypeHint(Union[int, str, None])
True # <-- this is madness.
# DOOR hints support rich comparisons Pythonically.
>>> usable_type_hint <= TypeHint(int | str | bool | None)
True # <-- madness continues.
# DOOR hints are self-caching.
>>> TypeHint(int | str | bool | None) is TypeHint(int | str | bool | None)
True # <-- blowing minds over here.
Basically, beartype.door
is what typing
should have been but isn't (and never will be). If a Pythonic object-oriented API for introspecting type hints would be of interest to plum
, just let us know what you'd like us to stuff in there. We'll make the magic happen on your behalf.
Thanks again to @tbsexton for the gracious ping! Furious fist-bumps to @wesselb for maintaining all of this mysterious magic and @ice-tong for their justifiable interest in both macro- and micro-optimization. Literally can't wait to see where you take plum
next, @wesselb. :fist_right: :fist_left:
@leycec thanks for dropping by! I hesitated to so rudely summon you from "hibernation" (read: implementing whatever beautiful insanity v0.12 is turning into), but I wont lie, I'm glad I did! :sweat_smile:
You mentinoned:
# The real-world, mostly. In a desperate last-ditch effort to push beartype 0.10.0
# out the door, I implemented is_bearable() in the crudest way possible. I wasn't
# convinced anyone was actually interested in performing their own runtime
# type-checking at the statement level. So, I just shipped the "beartype.abby"
# subpackage out the door as fast as possible. We know how this story ends.
I actually had started writing out a reply to @ice-tong somewhat along those lines, since I had noticed the is_bearable
interface source code had a big #FIXME: Optimize us up, please.
.
What I'm saying is... plum really wants to defer as much of its type hint handling to third-part(y|ies) as possible. Let plum obsessively focus on what plum excels at, which is multiple dispatch. Let somebody else handle the actual type hints, because ain't nobody got 108 consecutive lifetimes to throw at that shade of nonsense.
This was exactly what I was hoping to hear, though how to do this smoothly and without breaking up @wesselb's workflow as much as possible would be excellent. I know the dry-python/classes
docs mention beartype for doing generic delegates, but I admit I can't really figure out how the two are meant to interop, when the only overlap I can see (so far) is both beartype and classes
are built on isinstance
's, at some deep, arcane level. This puts us back where we started...
Needless to say, DOOR is everything I was ever scared I needed, so I'd be happy to beta test for some ideas I've been floating around. Is there an issue/PR I could migrate the discussion to, to keep this on-topic?
no choice but to grit our teeth, hit multiple consecutive shots of home-distilled whisky, clamp down on a leather strap, and brace for incoming pain.
I feel like we need our own masochisttypes-and-functional-programming version of PyCon to commiseratecollaborate on, for all of us weirdos that like this sort of thing. Enough (Dask, Xarray, Jax, Toolz, BiDict, Pandera, Pydantic, and PyViz) folks have done things like this to warrant our own (heretical? ) community gatherings. I know I've wanted, say, higher-kinded-types an untold number of times to handle this sort of thing, which took a small army to describe to others, yknow, why (another of my favorite issues to despairingly track/doomscroll) Or e.g. have you seen coconut-lang
? @evhub does some amazing AST manipulation to handle dispatching on pattern matched function definitions, another ingenious bit of wizardry that may be up your alley. It's really just the lack of LSP and emacs tooling that keeps me back in vanilla python land most of the time :sweat_smile:.
Anyway, thanks @leycec for dropping by, and @wesselb this is a lot of text-walls to come back to
but I guess I just want to say, we're here to help? Happy to continue this discussion here or in a new thread, etc., if beartype/typegaurd reliance interests you.
Sup @tbsexton and @leycec,
Iām currently away on holidays, but I saw emails for your messages popping up and just wanted to briefly thank for you for the fantastic posts! Once Iām back to a keyboard Iāll reply properly.
In the meantime, I just wanted to say that beartype looks absolutely fantastic, and Iām puzzled that I wasnāt aware of the library before. I am more than happyāhell, Iād even much prefer to have plum offload everything that isnāt multiple dispatch to other libraries. I realise that some appreciate that plum currently is dependency free, so perhaps a simplistic fallback (like the current type_of) in case beartype / other dependency isnāt installed seems like a good solution that might yield the best of both worlds.
@wesselb: Wondrous idea. Graceful fallbacks and optional dependencies are the only way. @beartype itself does that all the friggin' time in our main codebase and test suite. Let us know how @beartype can help plum
pluck all that ripe low-hanging fruit for multiple dispatch. Oh, and I hope you had a phenomenal holiday! August and September are the time to be alive here in Canada; my wife and I pretty much just lived on trail bikes for the past three weeks. :sun_with_face: :mountain_bicyclist: :sun_with_face:
@ice-tong: Behold! @beartype commit 8b15fa resolves the shameful performance regression in is_bearable()
, which now guarantees the same O(1)
"constant-time with negligible constants" runtime performance in all edge cases as the @beartype
decorator:
In [1]: from typing import List, Tuple
In [2]: from beartype.abby import is_bearable
In [3]: import torch
In [4]: import numpy as np
In [5]: typing_hints = List[Tuple[torch.Tensor, torch.Tensor]]
In [6]: data = [(np.array(1), np.array(2)) for _ in range(10000*100)]
In [7]: %time is_bearable(data, typing_hints)
CPU times: user 31 Āµs, sys: 2 Āµs, total: 33 Āµs
Wall time: 36.7 Āµs
Out[7]: False
I threw all of Saturday and Sunday evening at making that magic happen, which meant refactoring the entire @beartype codebase behind the scenes to support dynamic generation of arbitrary type-checking tester functions. Please don't do this at home, kids. :sweat_smile:
Let me know if you hit any other speed snafus, @ice-tong. Actually... don't let me know. I intend to spend next weekend doing nothing but playing every video game that exists.
Holy bananas, bearman
Well @leycec that's a tour-de-force, right there. Incredibly useful! Now... go get some sleep, I guess? :sweat_smile:
I intend to spend next weekend doing nothing but playing every video game that exists.
Whoa! That's a massive effort. I hope you indeed spent last weekend playing your entire catalogue of video games. š
First of all, thanks @leycec, @tbsexton, and @ice-tong for the in-depth discussion. Itās much appreciated.
On a high level, for Plum to function, we basically need two things:
(Thing 1) A function type_of
which takes in an object o
and gives back the type of o
. More specifically, it should give the most strict type t
such that o
is an instance of t
. The built-in function type
does this, to some extent, but obviously it doesnāt deal with generics:
>>> type([1, 2])
list # Desired: list[int]
As @leycec rightly so says, implementing such a function, unfortunately, is a massive can of worms, and probably the biggest challenge to tackle.
The core issue is that type_of() will fail in the general case, because there are a countably infinite number of edge cases that grow monotonically with each new CPython minor version and publication of new PEP standards. No. Seriously. To paraphrase childhood nostalgia: "It's dangerous to go alone."
If we could not need Thing 1 and only require the ability to check whether objects are of a certain type at runtime, then we could use @beartype all the way. Sadly, though, I do think multiple dispatch requires a type_of
. The issues of which you speak, @leycec, which are (issue 1) variable runtime API of type hints across Python versions and PEP standards (see below), (issue 2) postponement (see also below), and (issue 3) ambiguity (see again below), are something that we've certainly felt. Admittedly, a general solution is likely not possible.
(Thing 2) For all types returned by type_of
, we require a total partial order <=
indicating whether one type is a subtype of another. As I understand it, @leycec, this is what TypeHint
from beartype.door
could provide, with the bonus of being blazingly fast. I am definitely in favour.
In addition to Thing 1 and Thing 2, we would really like to ability to define custom types which extend the type system. One such example would be something that we could call a "deferred type". By deferred types I mean types which can already be used for type hints, but which may not yet exist. Forward references, I suppose, are one such example. Another example would be TFTensor
, which initially resolves to NotYetImported[ātf.Tensorā]
but eventually resolves to tf.Tensor
once TensorFlow is imported. We desire the ability to construct custom type that can implement such behaviour and nicely interact with (1) existing types, (2) type_of
, and (3) <=
.
Plum currently attempts to deal with this by converting the type to a string and parsing the string. Yes, really. The parsed string is then wrapped in versions of Union
et cetera with a consistent interface, which are then used. Clearly, this is not a scalable approach. Again quoting @leycec,
ain't nobody got 108 consecutive lifetimes to throw at that shade of nonsense.
If at all possible, I would really like to get rid of this wrapping of types. (@PhilipVinc)
Forward references have been a pain to get right and are responsible for a fair chunk of the complexity of Pum. @leycec, you say that @beartype resolves PEP 563-postponed type hints. Is that something Plum could borrow, too?
'Union[MyClass, int]' is a valid postponed type hint but not a valid forward reference. Right?
You're right that "Union[MyClass, int]"
won't work, though our intention is that, currently, the user uses a slightly different pattern:
from typing import Union
from plum import dispatch
class MyClass:
@dispatch
def f(self, x: Union["MyClass", int]):
return x
Briefly on this topic, there was a time where I thought the following would be a good idea: whenever a user desires a forward reference, require from __future__ import annotations
and simply rely on typing.get_type_hints
to work the magic. I have a feeling, though, that this may a terrible idea, because from __future__ import annotations
might break stuff. @leycec, @tbsexton, @ice-tong, @PhilipVinc, just mentioning this here in case any of you would think this is either a horrible or fantastic idea.
How do you differentiate between a fixed-length tuple (e.g., an object satisfying typing.Tuple[int, int]) and a variadic tuple (e.g., an object satisfying tuple[int, ...])?
Fortunately, I believe this should not be an issue, as long as Thing 1 and Thing 2 are soundly implemented. That is, type_of(o)
should always give the most strict type t
such that o
is an instance of t
, and all types should soundly implement the total partial order <=
.
For the above example, type_of((2, 2))
therefore should give Tuple[int, int]
, and the total partial order should satsify Tuple[int, int] <= tuple[int, ...]
.
What I'm saying is... plum really wants to defer as much of its type hint handling to third-part(y|ies) as possible.
I wholeheartedly agree. My sense is that it might be time for a 2.*
series of Plum where we reconsider all of the above and try to simplify the design of the package, like getting rid of type wrapping; and as much as possible rely on other libraries like @beartype for typing, with simplistic fallbacks in case those libraries aren't loaded.
Excellent exegesis, @wesselb! As always, because you're right about everything, I'm in agreement with everything. I'm also on the eroded precipice of releasing beartype 0.11.0
and thus a bit pressed for volunteer time. But... this is exciting, so let's do this instead.
A function
type_of
which takes in an objecto
and gives back the type ofo
.
Yes! You also want to do this in O(1)
time with negligible constants, which is (yet again) right in @beartype's wheelhousecave. To do this in O(1)
time, some combination of @beartype and/or Plum will want to pursue a similar strategy as @beartype's core O(1)
type-checking algorithm. That is, rather than exhaustively introspecting all items of an arbitrary container in O(n)
time to decide the nested subtype of that container, we and/or you instead pseudo-randomly introspect only a single item efficiently selected at pseudo-random in O(1)
time, decide the type of only that item, and then blindly adopt that type as the nested subtype of that container by the power of casual hand-waving.
Again, @beartype can help you here. You might want to allow your users themselves to select between O(1)
, O(log n)
, and O(n)
type detection strategies, right? Thankfully, beartype 0.10.4
is one step ahead. @beartype already provides a pair of public APIs for expressing these user desires:
# Type-check all items of a list in O(n) time.
from beartype import beartype, BeartypeConf, BeartypeStrategy
@beartype(conf=BeartypeConf(strategy=BeartypeStrategy.On))
def get_list_size_slowly(muh_list: list[str]) -> int:
return len(muh_list)
The same support extends to our statement-level runtime checkers as well:
# Type-check all items of a list in O(n) time.
from beartype import BeartypeConf, BeartypeStrategy
from beartype.abby import die_if_unbearable
big_list_of_strings = ['Pretend', 'this', 'is', 'alotta', 'strings.']
die_if_unbearable(big_list_of_strings, conf=BeartypeConf(strategy=BeartypeStrategy.On))
So, in synopsis:
beartype.BeartypeStrategy
is an enumeration with members:
O1
, expressing a desire for O(1)
-style constant-time typing.Ologn
, expressing a desire for O(log n)
-style logarithmic-time typing.On
, expressing a desire for O(n)
-style linear-time typing.beartype.BeartypeConf
is a configuration dataclass enabling users to fully configure type-checking through various parameters, including:
strategy
, an instance of beartype.BeartypeStrategy
.Onward and upward, valiant typing soldiers!
For all types returned by
type_of
, we require a total order<=
indicating whether one type is a subtype of another.
Partial order, actually. But, yes! Exactly! A partial order is what builtin Python sets and frozen sets provide as well, I believe. The idea here is that you can only reasonably compare semantically commensurable type hints (i.e., type hints that are meaningfully related to one another). Examples include Union[str, bool, None]
and str | bool
, where those two type hints are commensurable; indeed, beartype.door.TypeHint(str | bool) < beartype.door.TypeHint(Union[str, bool, None]) is True
as expected.
Type hints that are semantically incommensurable, however, always compare as False
. Examples include str | bool
and Callable[[], None]
. Since there's no meaningful relationship between a union and a callable type hint:
beartype.door.TypeHint(Callable[[], None]) < beartype.door.TypeHint(str | bool) is False
.beartype.door.TypeHint(str | bool) < beartype.door.TypeHint(Callable[[], None]) is False
, too! :scream: Thus, partial rather than total order.
Everything above is shipping with beartype 0.11.0
, to be released tonight...ish. Please Gods, let it be tonight. :sweat:
As I understand it, @leycec, this is what
TypeHint
frombeartype.door
could provide, with the bonus of being blazingly fast.
Yes! Blazing fast is @beartype's two middle names. Prod us if performance regressions rear their ugly mugs again and we'll promptly wack them with the Mallet of Justice.
Depressingly, I just ran out of time. beartype 0.11.0
still isn't out and I basically answered none of @wesselb's questions.
Oh, right! PEP 563-postponed type hints. That's critical. So, beartype.door.TypeHint()
currently fails to resolve PEP 563-postponed type hints for you ā but absolutely will under a future release, just probably not in time for beartype 0.11.0
. The following black magic will work someday; we just have to wire a few things together to make that happen:
>>> from beartype.door import TypeHint
# Currently fails, but definitely will work someday.
# I swear. Would @leycec lie? *sweat drips down forehead*
>>> resolved_typehint = TypeHint('Union[MuhClass, MuhOtherClass, bool]')
>>> resolved_typehint.hint
Union[MuhClass, MuhOtherClass, bool] # <-- everything resolved, yo
:sweat:
Thus, partial rather than total order.
Yes. š¤¦ How very silly of me... You're totally right, of course! Let's pretend I said partial order all along. š
Yes! You also want to do this in O(1) time with negligible constants, which is (yet again) right in https://github.com/beartype's wheelhousecave.
Oh, yesāplease! If @beartype could be used to implement type_of
, and the user would even be able to select an O(1), O(log n), or O(n), strategy, then I think that would be the perfect outcome. :) I think that would be really cool and satisfy all use cases.
So, beartype.door.TypeHint() currently fails to resolve PEP 563-postponed type hints for you ā but absolutely will under a future release, just probably not in time for beartype 0.11.0. The following black magic will work someday; we just have to wire a few things together to make that happen:
The goodness just keeps on going! Also this sounds fantastic. :)
I'm also on the eroded precipice of releasing beartype 0.11.0 and thus a bit pressed for volunteer time.
Please do prioritise your own work and don't let this issue take up too much of your time! I am very appreciative of all your time invested in the discussion thus far, @leycec, @ice-tong, @tbsexton.
:boom:
Just did things. The new Plum-friendly beartype.door
and beartype.peps
subpackages have gone public with our hot-off-the-presses release of @beartype 0.11.0.
beartype.door
: Never Leave typing
Without ItThe beartype.door.TypeHint
class now provides a partial ordering over type hints ā amongst other ludicrous things like Pythonic indexing, iteration, and membership testing of child type hints:
# This is DOOR. It's a Pythonic API providing an object-oriented interface
# to low-level type hints that basically have no interface whatsoever.
>>> from beartype.door import TypeHint
>>> union_hint = TypeHint(int | str | None)
>>> print(union_hint)
TypeHint(int | str | None)
# DOOR hints have Pythonic classes -- unlike normal type hints.
>>> type(union_hint)
beartype.door.UnionTypeHint # <-- what madness is this?
# DOOR hints can be classified Pythonically -- unlike normal type hints.
>>> from beartype.door import UnionTypeHint
>>> isinstance(union_hint, UnionTypeHint) # <-- *shocked face*
True
# DOOR hints can be type-checked Pythonically -- unlike normal type hints.
>>> union_hint.is_bearable('The unbearable lightness of type-checking.')
True
>>> union_hint.die_if_unbearable(b'The @beartype that cannot be named.')
beartype.roar.BeartypeDoorHintViolation: Object b'The @beartype that cannot
be named.' violates type hint int | str | None, as bytes b'The @beartype
that cannot be named.' not str, <class "builtins.NoneType">, or int.
# DOOR hints can be iterated Pythonically -- unlike normal type hints.
>>> for child_hint in union_hint: print(child_hint)
TypeHint(<class 'int'>)
TypeHint(<class 'str'>)
TypeHint(<class 'NoneType'>)
# DOOR hints can be indexed Pythonically -- unlike normal type hints.
>>> union_hint[0]
TypeHint(<class 'int'>)
>>> union_hint[-1]
TypeHint(<class 'str'>)
# DOOR hints can be sliced Pythonically -- unlike normal type hints.
>>> union_hint[0:2]
(TypeHint(<class 'int'>), TypeHint(<class 'str'>))
# DOOR hints supports "in" Pythonically -- unlike normal type hints.
>>> TypeHint(int) in union_hint # <-- it's all true.
True
>>> TypeHint(bool) in union_hint # <-- believe it.
False
# DOOR hints are sized Pythonically -- unlike normal type hints.
>>> len(union_hint) # <-- woah.
3
# DOOR hints reduce to booleans Pythonically -- unlike normal type hints.
>>> if union_hint: print('This type hint has children.')
This type hint has children.
>>> if not TypeHint(tuple[()]): print('But this other type hint is empty.')
But this other type hint is empty.
# DOOR hints support equality Pythonically -- unlike normal type hints.
>>> from typing import Union
>>> union_hint == TypeHint(Union[int, str, None])
True # <-- this is madness.
# DOOR hints support comparisons Pythonically -- unlike normal type hints.
>>> union_hint <= TypeHint(int | str | bool | None)
True # <-- madness continues.
# DOOR hints are semantically self-caching.
>>> TypeHint(int | str | bool | None) is TypeHint(None | bool | str | int)
True # <-- blowing minds over here.
beartype.peps
: The Final Word on PEP 563The beartype.peps
subpackage is considerably simpler. Currently, it just provides a resolve_pep563()
function fully resolving PEP 563-postponed type hints. This is probably the closest Python will ever get to a reference implementation of runtime PEP 563 support:
# In "horrorshow.py":
from __future__ import annotations # <-- people should really stop doing this
def sad_function_is_sad() -> str | bytes | None:
if 'sad':
return "You can't spell 'sadness' without madness (excluding the 'm')."
else:
return b'This cannot be! But it is! An unreachable condition erupts.'
print('Uh oh. We got postponed type hints here:')
print(repr(sad_function_is_sad.__annotations__['return']))
print()
# Your time is over, PEP 563.
from beartype.peps import resolve_pep563
# Make the bad postponed type hints go away, Bear Daddy.
resolve_pep563(sad_function_is_sad)
print('Suck it, PEP 563. Only real type hints need apply:')
print(repr(sad_function_is_sad.__annotations__['return']))
Now run that, because you trust @leycec implicitly. You do trust @leycec implicitly, don't you? smiling_face_with_tear
$ python3.10 horrorshow.py
Uh oh. We got postponed type hints here:
'str | bytes | None'
Suck it, PEP 563. Only real type hints need apply:
str | bytes | None
type_of()
Oh, Gods. Here we go, folks.
I've given a bit of thought to how beartype
might implement something resembling Plum's type_of()
. I'm convinced it can be done in O(k)
time for k
the height of a type hint tree (e.g., k == 3
for Union[str, list[int], tuple[set[bool], ...]]
). Since almost all type hints of interest to anyone that matters are shallow in height, that's basically O(1)
time. Recursion is dangerous and slow in Python, so we implement it using a single loop.
Relatedly: this is actually how @beartype dynamically generates Python code that runtime type-checks arbitrarily complex type hints. Yup. It's all a single loop. :sweat_smile: :cold_sweat: :hot_face:
So. A @beartype-style type_of()
implementation guarantees O(k)
time by only randomly sampling a single item at each nesting depth of a container. For example, given a container satisfying the type hint list[list[str]]
(i.e., a list of lists of strings), a @beartype-style type_of()
would:
list[list[str]]
.Easy 'nuff, right? The devil's in the details. Most containers don't support random access; they only support sequential access (e.g., dictionaries, sets), which means you can only efficiently access the first item. Moreover, many iterables can't be safely iterated at all idempotently, because the mere act of iteration fundamentally changes the iterable (e.g., generators, coroutines, counters). The theoretical complexities spiral into real-world hardships pretty fast.
Moreover, what about infinitely nested containers. :astonished:
plaid_speed = ["We've now hit Ludicrous Speed.",]
plaid_speed.append(plaid_speed)
Nobody ever actually uses infinitely nested containers in production code, because we are all sane here ā but malicious Bad Dudes will inevitably pass one of these monstrosities to something that transitively calls type_of()
. And then Python blows smelly chunks everywhere.
Okay. Technically, one can and should protect against that. Plum is probably doing something like this already. You just internally cache the id()
of each previously visited item in a local set
named seen_em
; then, before delving any deeper into an item, you just test whether you've already done that before and avoid doing so if you have. Right? Potential catastrophe averted.
Sounds like blood, thunder, and fun. But will @leycec ever find volunteer open-source time to actually do that when he still hasn't authored sound ReadTheDocs-hosted documentation for @beartype? Tune in next exciting GitHub monologue to find out all this... and more. :bear:
@leycec @wesselb So --- and this might be totally naive --- do we need a type_of
function to operate in the first place? Can we just... solve the inverse problem?
Hear me out:
Ostensibly, any user of multiple dispatch here isn't actually trying to dispatch on every extant type possible, but rather they register allowable types and provide a mechanism for extensibility. This means that, at runtime, we're not actually needing to know _"what is the type_of
my input", but rather, "Is my input a registered type, and if so, which one is best?"_
You might be thinking (as I did, previously) "hey wait! I have to know what the type is in order know if it's registered...right?" And that's kind of correct. Except that it happens we have a "magic oracle" a la is_bearable
, such that we can ask "what about this one?" and each time get a YES or a NO.
So, yeah, for a given input, if the "nice" type information isn't known at runtime (e.g. complex, nested iterables, and anything that type()
gives us back garbage for) we can just go down the list in approximately O(n) worst-case time... it's a sort of type hash-table.
So, this is a lot like the good old nearest-neighbor problem, right? (no? Not following? alright, hang on...)
If I have a bunch of data inputs and outputs, and some new data comes along, I want to guess what the output should be. If I were DATA GOD
, I would have a magic, domain-driven function that spat out the one-true answer every time. In our situation, this is the same as having our type_of()
function, fully operational (...battle-station?).
But we aren't usually DATA GOD
, so we cheat. We look up in our table of known inputs and see if any of them are "close". In this case, this is our is_bearable()
. Ok, so far so good. Except, O(n) is the bane of nearest neighbors' existence, since generally we want to do this a lot.... like an O(n^2) lot. So what do we do?
By imposing a hierarchy onto the space of data (e.g. different sized binnning of the cartesian coordinates) we can ask fewer, simpler questions... is my new point in the left or right half of my data? Ok, now in the left or right of that? And that?
This is a (really awful) way of describing a binary search tree (more like a kd-Tree). But anyways, we do have hierarchy information, no? there's a partial order on types, and iirc containers can be treated as a tree of sorts...
Let's make a prefix tree of some kind, that includes everything beartype
already knows about, plus the set of registered TypeHint
stuff that the dispatcher has actually been given. Then we treat the dispatch problem as a tree search problem.
brief sketch
beartype.door
(or plum, idk) has a search tree of known types, based on their partial order. Generic types (a la union, protocols, etc) would just recursively use the same search tree, just for the type parameter instead of the type itself. plum
registers types to the dispatcher... these can be either already in the search tree, or added explicitly as new type alias' or something. All of the registered types are effectively just adding a bool (registered=True
) as their node value. This step is where the new door api is invaluable... we can insert the dispatched types into their correct spot in the tree right away. type_of
, assuming it's either a type we already know or have been told about in the dispatcherFor your example of infinite lists, then either
List[List[List[str]]]
, damnit!), ORList[List[List[List[List[List[...]]]]]]
is a List[List[List]]
, so... we only need to know if it's the latter or not... 3 steps deep in the tree, as spec'd in the dispatcher). Technically, we don't even need the first step... plum
could handle making a prefix tree by itself, and only of types that we dispatch on. BUT I feel like the overhead is minimal, and we'd get really rich debug info if we had the full tree, so... shrug?
Admittedly, this is going to run in O(log n) for n types known a priori (whether all known types, or just the dispatched types, see above). My assumption here is that the number of possible input types is going to WAY dwarf the number of types a dispatcher is registered to, so I'd personally rather scale by the latter. ALSO, we can always memoize this, so that if we do find a decent type to "match" our input, it goes right to the O(1) lookup table?
Yup. Smarty-pants NIST scientist @tbsexton with the game-winning layup at the buzzer. A binary search tree-ish monstrosity administered by beartype.door.is_bearable()
is a clever alternative ā and possibly even simpler (despite its monstrous form).
It's germane for me to admit at this critical juncture that @beartype currently does not deeply type-check nearly as much as everyone believes it does: e.g.,
>>> from beartype.door import is_bearable
>>> is_bearable({'This is...': 'kinda non-ideal.'}, dict[bool, float])
True # <-- oh gods wth is this horror show
The reason being I haven't found a spare hot minute to throw at deep type-checking. This is embarrassing, because that's what we're supposed to do. I kinda got distracted by other deceptively enticing open feature requests like documenting everything on ReadTheDocs (RTD) :grimacing: and the implicit numeric tower and is_color
... and here we are.
Maybe I'd just better do deep type-checking, instead. I'm confident I can spin up deep type-checking support for sets and mappings (i.e., dictionaries), which hopefully covers most of the remaining type hints plum
users would be expected to dispatch on. If I had a goatee, I would be stroking it optimistically.
Hah I mean, we'll see if this is viable, and monstrous just sounds fun right? :P
As for deep-checking, this is strictly for (based on the nasty example we saw starting this whole thing) dict, set, and tuple? Since those don't randomly sample without some ctypes shenanigans? Hmm...
I see your door api already does this? e.g.
>>> TypeHint(T.Dict[T.Any, int]) > TypeHint(T.Dict[bool,int])
True
>>> TypeHint(T.Dict[T.Any, int]) < TypeHint(T.Dict[bool,int])
False
>>> TypeHint(T.Dict[str, T.Dict]) > TypeHint(T.Dict[str,T.Dict[str,int]])
True
>>> TypeHint(T.Dict[str, T.Dict[str, T.Dict]]) > TypeHint(T.Dict[str,T.Dict[str,int]])
False
So yeah that's awesome. AND we now know, a priori, how deep a given dispatcher needs to go (since by construction we are assuming the universe of possible types is given by the dispatcher).
For my case, if my dispatcher had the audacity to register TypeHint(T.Dict[str, T.Dict[str, T.Dict]])
, then in my "vision", we recursively traverse the tree every time there's a new generic's parameter. It's like a new level of dreams in inception; we just "start over" with the parameter as our type-in-question. So we shouldn't ever really be asking is_bearable
about anything deeper than one level, right?
>>> my_json={'look':{'forthe':{'bear':'necessities'}}}
>>> is_bearable(my_json, T.Dict) # <-- the root check
True
>>> is_bearable(my_json.items[1], T.Dict) # <-- the 1st-level child check
True
>>> is_bearable(my_json.items[1].items[1], T.Dict) # <--- I have no memory of this place
True
>>> is_bearable(my_json.items[1].items[1].items[1], str) # this way! the air is not so foul down here
True
I skipped the items[0] checks, but the point is the same...each time it's just a check against the generic parameter, and whatever we found inside the corresponding container in what got passed? We don't need to randomly access anything, since we know a priori what exactly should be at every level of the registered type hint.
Am I missing something really obvious? I'm running a tad low on sleep... :sweat_smile:
EDIT: I thought of an easier way to describe that monologue (thank you, sleep)... It's a separation of concerns between type variance coming from
We only really need the search tree for the first one, because parameters are either also needing the first one, or also have their own parameter, and down the recursion we go, until we hit the max depth specified by the dispatch registry (this part is important!)
Obviously it's easier (for us, anyway) if beartype handled both types in one call, with deep-checks. But frankly I'm too excited for real documentation and whatever is_color
is that I can't imagine adding more stuff to that pile (you maniac :laughing:)
Ah, ha. High-IQ NIST researcher is high-IQ, which explains why one of us works at NIST while the other lives in a cabin in the woods. So... you're absolutely right. I think. I mean, that's the safe bet here, isn't here? See also our respective workplaces.
As you astutely observe, the subset of the beartype.door
API that implements our partial ordering over type hints (e.g., <=
, is_subhint()
) does miraculously appear to order both mapping and set type hints correctly. I publicly confess that I don't know how any of that works, because Harvard microscopist @tlambert03 kindly did everything for us. We consider that a win.
Likewise, the subset of the beartype.door
API that implements runtime type-checking (e.g., die_if_unbearable()
, is_bearable()
) does not yet deeply type-check mappings or sets. But you cleverly realized Plum can kludge around that, because Plum knows the type hint it expects a priori. Let k
be the nesting height of any type hint (e.g., k == 3
for typing.AbstractSet[typing.AbstractSet[int]]]
). Then Plum can just repeatedly bang is_bearable()
against each of the k
possibly nested containers matched by that known type hint in a tight inner loop with time complexity O(k)
. Since we expect k
to be really small for real-world type hints, that's basically O(1)
.
Hello, @leycec and @tbsexton! I'm very sorry for the radio silence on my part. Between wrapping up the PhD and starting a full-time job I've not managed to find time for everything I've been wanting to do, but I'm finally slowly finding a balance.
@leycec @wesselb So --- and this might be totally naive --- do we need a type_of function to operate in the first place? Can we just... solve the inverse problem?
You're onto something here! In fact, I think this is really good idea. As you say, this completely avoids the need to implement a function type_of
. And with @beartype to do the type checking, this would be really fast too!
Your proposal has made something really clear to me. In Python, in many practical scenarios, the type of an object is not sufficient to appropriately perform dispatch. That is, whereas we would hope that
isinstance(x, Type) == issubclass(type(x), Type)
this identity is more than often not true. This is made really clear by #66, which is a feature request for support for literal types.
This insight makes me believe that your proposed approach would be a strictly better way of doing things, if not for one downside: caching. Since the type of an object cannot be relied upon to find the right method, caching would not be possible. I would therefore propose the following design, which attempts to combine the best of both approaches.
Let us call a type Type
faithful if the identity
isinstance(x, Type) == issubclass(type(x), Type)
holds true. If all types in the dispatch tree (I guess we could use this terminology?) for a function are faithful, then dispatch only depends on the types of the arguments, which means that caching is possible again. Therefore, whenever all methods of a function only usedfaithful types, fast method lookup is possible. This gives the user the ability to (1) only use faithful types and get optimal performance or (2) use more complicated types at the cost of a small performance penalty. I think (1) is very important for functions which are called very often.
I think performing dispatch only based on isinstance
checks would resolve a lot of issues. To be honest, I'm so convinced that I'm going to start prototyping right away! Just tagging @PhilipVinc here too since he's been closely involved in the development.
It's germane for me to admit at this critical juncture that @beartype currently does not deeply type-check nearly as much as everyone believes it does:
The reason being I haven't found a spare hot minute to throw at deep type-checking.
What @beartype currently already does is amazing! I'm absolutely sure we can find a workaround for deep type checking that suffices in the vast majority of use cases. @tbsexton's proposal seems like an excellent way to go about this in a first iteration.
Still digesting, but lemme point out something real quick:
If all types in the dispatch tree (I guess we could use this terminology?) for a function are faithful, then dispatch only depends on the types of the arguments, which means that caching is possible again.
So there's two separate "trees" in my proposal. One is the dispatch tree:
- I have some type, so what is the correct function?
This is where your faithful vs. not faithful thing is most relevant, because if I have a partial order over types, I can guarantee a cacheable dispatch function (dispatch only depends on types). So-called unfaithful types are ones that break the partial order somehow... In your case given, they aren't subclasses.
BUT the beartype.door api solves that. Theoretically, every possible type hint is now fully "faithful", as long as we a) have the type hint, and b) use the TypeHint wrapper with the <=
/is_subhint
tool. Beartype "solves" the faithful caching problem for us. They've built the first tree.
But now I've assumed I know the type, so...
- If I have some input, what is it's type?
This is where we really don't have a good answer in general. Instead we have a nearly perfect chance of answering "is it a subtype of this other type (yes/no)" using beartype is_bearable
on the kinds of things normal humans want to actually dispatch over.
So we construct a binary search tree of TypeHints we know/care about (which we can make really easily using the first tree via is_subhint
) to ask beartype "good" yes or no questions (is my input an object
? Great! Is it an....) to approximate our missing type_of
in pretty much constant time.
So, two separate trees. Technically, the second is a tree the dispatcher builds for us from the first one, since we already get the first one, for free, from beartype partial order. Because they're the same tree, but we have to find clever subsets of the (infinite?) first one to avoid making type_of
BUT the beartype.door api solves that. Theoretically, every possible type hint is now fully "faithful", as long as we a) have the type hint, and b) use the TypeHint wrapper with the <=/is_subhint tool. Beartype "solves" the faithful caching problem for us. They've built the first tree.
What about a method f(x: Literal[1])
? If x
were 1
, then its type would be int
. However, knowing that its type is int
is not enough to establish that the method applies to 1
. This is the prime example of an "unfaithful" type that I have in mind. Moreover, even if we know that the method applies to 1
, e.g. using is_bearable
, then we cannot use the type of 1
to perform caching, because the method should only apply to 1
s and no other int
s.
So-called unfaithful types are ones that break the partial order somehow... In your case given, they aren't subclasses.
Hmmm, as far as I can see, Literal[1]
would not necessarily break the partial order, right? I believe issubclass(Literal[1], int)
would hold true!
So we construct a binary search tree of TypeHints we know/care about (which we can make really easily using the first tree via is_subhint) to ask beartype "good" yes or no questions (is my input an object? Great! Is it an....) to approximate our missing type_of in pretty much constant time.
Yes! This is super nice and makes a whole lot of sense!
So, two separate trees.
Right, gotcha! I think I understand the distinction you're making.
OH I gotcha. Ok so in this case, I assumed we would have an ambiguity resolution precedence. Either the order the dispatch rules were defined in the code, or (my preference) the most specific applicable rule always gets applied.
E.g.
method
f(x: Literal[1])
? Ifx
were 1, then its type would be int
Which is only true if int
is in the dispatcher somewhere else (e.g. method f(x: int)
is also defined) and it takes precedence. If the user hasn't defined f(x: int)
then the "second tree" would only be able to resolve to Literal[1])
for x=1, and fail for other int.
If they have defined both, I think the search tree should continue as far down as it can, to obtain the most specific/narrow applicable type. Similar to how classes works, just simpler with beartype resolution checking. Theoretically this would be cacheable (I hope!) š
the most specific applicable rule always gets applied.
I think this is what should happen! (And how the current design works.)
Hmm, just to be sure we're on the same page, let me walk through a slightly more elaborate example.
Consider the code
from numbers import Number
from typing import Literal
from plum import dispatch
@dispatch
def f(x: str):
return "str!"
@dispatch
def f(x: Number):
return "number!"
@dispatch
def f(x: int):
return "int!"
@dispatch
def f(x: float):
return "float!"
@dispatch
def f(x: Literal[1]):
return "the literal one!"
This registers five methods for the function f
, with the following four signatures: (str,)
, (Number,)
, (int,)
, (float,)
, and (Literal[1])
. Using beartype.door
, these signatures may be arranged into the following two directed acyclic graphs (DAG):
(str,)
(Number,) -> (int,) -> (Literal[1],)
|
v
(float,)
Note that Literal[1]
should indeed satisfy TypeHint(int) >= TypeHint(Literal[1])
. Also note that that Literal[1]
is unfaithful, because
>>> isinstance(1, Literal[1])
True
>>> issubclass(type(1), Literal[1])
False
We desire the following behaviour:
>>> f("1")
'string!'
>>> f(1)
'the literal one!'
>>> f(1j)
'number!'
>>> f(1.0)
'float!'
>>> f(2)
'int!'
>>> f(object())
MethodNotFoundError: cannot find an appropriate method
To determine the right method, there are two ways of doing this:
isinstance
-Based MethodWhenever f(x)
is called, start at the top of the first DAG. Call the root node n
.
not isinstance(x, n)
, reject the DAG and proceed to the next DAG and repeat step 1. Otherwise, proceed to the next step.n
, check isinstance(x, n)
. If there are no matches, the method corresponding to n
is appropriate for this DAG. If there is exactly one match, name this child n
and repeat step 2. If there is more than one match, return an ambiguity error.If there is an appropriate method for one and exactly one DAG, we have succesfully found the right method to run!
type(x)
-Based MethodIf all types are faithful, then every call isinstance(x, n)
can be replaced with issubclass(type(x), n)
(or TypeHint(type(x)) <= TypeHint(x)
). The big consequence of this is that the algorithm only depends on type(x)
. Therefore, once we run the algorithm, we can associate the resulting method to type(x)
, and run this method for any future arguments of this type! In other words, we can do caching.
If the types are not all faithful, then I'm afraid that there is no such way to do caching, because the outcome of isinstance(x, n)
can only be determined by actually running this instance check. Therefore, the isinstance
-based method must be run for all invocations of f
.
Side node: If some of the types in the type tree are unfaithful, then caching may still be possible. Namely, if, for a particular input x
, the isinstance
-based method happens to only come across faithful nodes, then caching should still be valid!
Another side note: By allowing the user to make type_of
smarter, more types can be rendered faithful. For example, if type_of(x) = Literal[1] if x is 1 else type(x)
, then isinstance(x, Literal[1]) == issubclass(type_of(x), Literal[1])
. In this way, by making type_of
smarter, the user would be able to gradually improve caching performance.
Let me work through two examples.
We call f(1)
. isinstance(1, str)
is false, so we reject the first DAG. isinstance(1, Number)
, however, is true. Now isinstance(1, float)
is false and isinstance(1, int)
is true, so we proceed to the node int
. Finally, isinstance(1, Literal[1])
is also true, and Literal[1]
is a terminal node, so we return the method corresponding to Literal[1]
.
We call f(1.0)
. isinstance(1.0, str)
is again false, so we reject the first DAG, and isinstance(1.0, Number)
is again true. Now isinstance(1.0, float)
is true and isinstance(1.0, int)
is false, and float
is a terminal node, so we return the method corresponding to float
.
Does this also correspond to what you have in mind?
Ahhh I got you. Ok, I have some deadlines here but very briefly let me clarify what parts I would hope are (ostensibly) cacheable:
First, I don't think we can ever really hope to use the native type()
for anything worthwhile. It's kind of a nightmare. BUT it is true that
>>> issubclass(type(1), Literal[1])
False
since a Literal isn't a subclass of Int. However, I'm actually imagining a compilation and 2x passes here.
isinstance
, but is_bearable
, which runs via beartype O(1)-ish time for every query. plumish_type(x)->TypeHint
, you can in theory directly use that type to match something from the dispatch: dispatcher: TypeHint->Callable
. You don't really need caching here either since everything in the tree corresponds to something you wanted to dispatch on. OK so now that I've said that, I think both 2 and 3 are still cachable. They are (pure) functions, since every input will deterministically end at either a type in the tree or fail if no type is available. But as you note, you wouldn't be caching on types in 2, only in 3. But if you think it's needed (e.g. if a user passes the same x over and over such that memoizing on x
rather than type(x)
is worth it), you could have a kwarg or something that enables that behavior.
I generally think that relying on any kind of type(x)
for python is going to land us in a world of pain. Exhibit-A: Enum
. While folks over at, say, pydantic are using Enum as pseudo tagged unions (like, yknow, the sum-types we all wish we had)[^2], it turns out that dispatching on enums (how we might want to) becomes horrifyingly impossible, given that the type()
of an enum member is always set as... enum :roll_eyes:. This is bad. Or at least, it was before beartype's from beartype.cave import EnumType, EnumMemberType
.
SO yeah I think the isinstance()
method is pretty much there, but we want to change that to is_bearable
, and, it's only after we've narrowed the search space down such that caching is, for all intents and purposes... optional?
^[1]: as a sidenote, if users want to do numeric type checks, I would direct them to the (beartype-fueled) numerary since they're another project running into type issues and trying to solve them :package:
^[2]: @leycec I will say, if you're looking for feature requests... damn I want a clean syntax for sum types/tagged union so. bad.. I don't want to have to add a janky assert_never
into my code to do exhaustiveness checking. I want clean algebraic types, and I want them bearable, damnit! :smile:
I don't think we can ever really hope to use the native
type()
for anything worthwhile. It's kind of a nightmare.
Yes. type()
is appropriate for scaring apprentice child software engineers at Halloween, but otherwise inappropriate. With respect to type hints, type()
lies, because many type hints and other objects of interest (e.g., enumeration members) lie about their types; the remainder have no public types. Some objects even have entirely different types depending on whether you're running under CPython or PyPy, which is worse than a lie. That's squabbling amongst parents who insist they love you.
This makes my eye twitch even more spasmodically than it usually does:
>>> issubclass(type(1), Literal[1])
False
I hear the Jaws theme when I see horrors like that.
is_bearable
, which runs via beartypeO(1)
-ish time for every query.
I see your "O(1)
-ish time" and raise you a table flip.
My bald-faced lies are hard to believe, but all of the runtime type-checking performed by @beartype (including is_bearable()
) is strictly O(1)
. I will eat my moth-eaten Merino sweater live on Gutter Gitter if this is not the case.
Okay. It's actually O(k)
for k
the nesting depth of a type hint (e.g., k == 2
for list[int]
). But k < 5 in almost all real-world cases. My lies live to see another day.
pydantic
Friends don't let friends pydantic. I kid, of course! Pydantic is probably great ā at least as a temporary stopgap until @beartype gracefully absorbs pydantic's feature set into itself like some gurgling tinfoil amoeba from Mystery Science Theatre 3000.
pydantic are using
Enum
as pseudo tagged unions
...this is not a thing people should be doing:
from dataclasses import dataclass
from enum import Enum
# Horrible sum type is horrible.
@dataclass
class SumType:
kind: Enum
data: object
# Rando enumeration is rando.
Token = Enum('Token', ['Number', 'Operator', 'Identifier', 'Space', 'Expression'])
muh_horrible_sum_type = SumType(Token.Number, 42)
That's awful, as both the SumType.kind
and SumType.data
fields are basically untyped. But that's not nearly as awful as the official example mypy
gives:
from typing import Literal, TypedDict, Union
class NewJobEvent(TypedDict):
tag: Literal["new-job"]
job_name: str
config_file_path: str
class CancelJobEvent(TypedDict):
tag: Literal["cancel-job"]
job_id: int
Event = Union[NewJobEvent, CancelJobEvent]
def process_event(event: Event) -> None:
# Since we made sure both TypedDicts have a key named 'tag', it's
# safe to do 'event["tag"]'. This expression normally has the type
# Literal["new-job", "cancel-job"], but the check below will narrow
# the type to either Literal["new-job"] or Literal["cancel-job"].
#
# This in turns narrows the type of 'event' to either NewJobEvent
# or CancelJobEvent.
if event["tag"] == "new-job":
print(event["job_name"])
else:
print(event["job_id"])
Seriously? Expensive O(n)
string matching just to narrow conditional branches? What is this burning trash pile you are badly recommending to long-suffering data scientists such as ourselves, mypy? :face_exhaling:
beartype's
from beartype.cave import EnumType, EnumMemberType
.
Gah! You have discovered all my undocumented secrets. I... I was pretty sure nobody even knew about that API. You're like some kind of GitHub bloodhound, sniffing out the bitrot and skeletons in my commit history. This is why you work at NIST.
I don't want to have to add a janky
assert_never
into my code to do exhaustiveness checking.
You're in luck then! In addition to being stupidly faster than prior CPython versions, the just-released CPython 3.11.0 publishes an official typing.assert_never()
function. It's still a bit silly, but at least you no longer need to jank your own codebase with it.
@leycec I will say, if you're looking for feature requests..
No. No! Oh, please GitHub Gods above (...so, Satya Nadella), no!!! I think I may be hyperventilating here.
I want a clean syntax for sum types/tagged union so. bad..
Yes! Do this for me, please. Do everything for me and then I will make this happen. Actually, this leads to an interesting factotum I just recalled. Did you know that typing.Literal[...]
:
Combining those two realizations, we can exhaustively type enumeration members:
from enum import Enum
from typing import Literal, assert_never
# Rando enumeration is rando.
Token = Enum('Token', ['Number', 'Operator', 'Identifier', 'Space', 'Expression'])
TokenKind = Literal[
Token.Number,
Token.Operator,
Token.Identifier,
Token.Space,
Token.Expression,
]
TokenTypes = int | str | OTHER_STUFF_I_GUESS
@beartype
def muh_tokenizer(token: TokenTypes, token_kind: TokenKind):
match token_kind:
case Token.Number:
...
case Token.Identifier:
...
blah-blah-blah!
case _:
assert_never()
@beartype supports that already. I think. No quotes, please.
Of course, that's just a really bad form of multiple dispatch. Plum would solve the same quandary considerably more elegantly, assuming typing.Literal
support: e.g.,
```python
from enum import Enum
from plum import dispatch
from typing import Literal
# Rando enumeration is rando.
Token = Enum('Token', ['Number', 'Operator', 'Identifier', 'Space', 'Expression'])
@dispatch
def muh_tokenizer(token: int, token_kind: Literal[Token.Number]): ...
@dispatch
def muh_tokenizer(token: str, token_kind: Literal[Token.Space]): ...
Assuming @plum.dispatch
internally reduces to something like @typing.overload
when statically typing.TYPE_CHECKING
, the above might even be statically type-checkable. Maybe. Or maybe a mypy-specific plugin that doesn't work under any other static type-checker would need to be authored. I shudder.
But you probably want something more Rust-like, right? Like, a full-blown SumType
data structure that you can then stuff into monstrous red-black tree nodes and algebraically reason about with both static and runtime type-checkers. Static type-checking support probably requires a full-blown PEP. Runtime type-checking support requires someone who is not me to have free time and submit a mostly untested, undocumented, non-working PR to @beartype. Yet again, I shudder.
I would direct them to the (beartype-fueled)
numerary
since they're another project running into type issues and trying to solve them :package:
@posita: People love numerary
. You're famous. Happy Halloween. :jack_o_lantern:
@tbsexton, I should've clarified that in all instances of isinstance
I mean to use is_bearable
and in all instances of issubclass
I mean to use TypeHint
and <=
. Note that, with TypeHint
, issubclass(Literal[1], int)
is true:
>>> TypeHint(Literal[1]) <= TypeHint(int)
True
determine the type of an argument, assuming the only available types are the ones the dispatcher explicitly cares about.
Isn't this the function type_of
that we're trying to avoid? Unless I'm misinterpreting, I believe that we're all in agreement that this is most likely not the right way forward.
Ok then! I think we've converged onto an approach that should generally work: the isinstance
-method implemented with is_bearable
. The best way to do caching, however, still remains to be figured out. I think that's enough for to me to get to work! š
I was not aware of numerary
. @posita, that's really nice work!!
I was also not aware of tagged unions, and I think I'd rather stay unaware. That's ugly.
@wesselb spitting hard truths as always. I applaud this timely advice, too:
I was also not aware of tagged unions, and I think I'd rather stay unaware. That's ugly.
@wesselb
Isn't this the function type_of that we're trying to avoid? Unless I'm misinterpreting, I believe that we're all in agreement that this is most likely not the right way forward.
Right, sorry, let me fix that sentence:
determine the type of an argument, given only the available types our runtime dispatcher explicitly cares about.
rather than:
determine the type of an argument out from the set of possible python and/or user-defined/imported types.
The second is the bad one. But a function that goes type_of(x:Any) -> typ:TypeHint
is going to be necessary. My point was that we can do the "caching"/time-savings ahead of time by restricting the space we search for typ
in, by using what the dispatcher was told x
is expected to be, and organizing it into a useful hierarchy (search tree).
As for why sum types (discriminated/tagged unions), it's actually a way we can a priori do dispatching for a bunch of complicated types, without needing to validate or check which subtype gets dispatched to, a priori. See this tidbit from the pydantic docs:
When Union is used with multiple submodels, you sometimes know exactly which submodel needs to be checked and validated and want to enforce this. To do that you can set the same field - let's call it my_discriminator - in each of the submodels with a discriminated value, which is one (or many) Literal value(s). For your Union, you can set the discriminator in its value: Field(discriminator='my_discriminator').
Setting a discriminated union has many benefits:
- validation is faster since it is only attempted against one model
- only one explicit error is raised in case of failure
- the generated JSON schema implements the associated OpenAPI specification
So what I imagined here was a more general "validate()" function that is effectively doing, you guessed it, multiple dispatch, where I don't have to do a bunch of try/except
or if isinstance()
blocks... it just knows. Pydantic achieves this with Literal
tags as metadata, ergo "tagged union". It's a janky solution, but I've used it to pretty great effect for speeding up code and reasoning about types. I'll be honest here, part of what drew me to plum and beartype was a way I could do similar things on, say, TypedDict
or NamedTuple
or dataclass
without selling my codebase's soul to the pydantic dependency gods :sweat_smile:.
So this is all a bit of a tangent right? well... it's connected, I promise! I came across a brilliant SO comment chain that illustrates what I'm talking about. With "Untagged", i.e. normal typing.Union
types
In other words, if you have an Int āŖ Double, then you'd know you have one of the two, but wouldn't be able to pattern match to see which, so you could only do things that would be valid to do for both possibilities.
But hang on, how do we seem to get along fine without tagged types in other languages?
For clarity, TypeScript has type information available at runtime, so it has a typeof operator that can make up for the lack of tagging and see which type is used anyway. Haskell is fully type-erased, so if it supported this feature, then there wouldn't be any equivalent to that.
SO we've come full circle! Hah! The entire issue here is that in reality python has what amounts to no real support for typeof
that lets us know what to do with e.g. generics and unions, etc. For all intents and purposes, we can (only in the context of a dispatcher, mind you), do a typeof_NEAREST_NEIGHBOR
that is super fast. That's the essence of my original proposal. But IMO it would be awesome to have "real" sum types/tagged unions/ADT's, so that we can treat python as type-erased the way that (apparently) Guido wanted us to :stuck_out_tongue:. Then something like e.g. Plum would get a huge performance boost when it knew it was dispatching over ADTs/sum types
@leycec so this is (kinda) getting out of scope here, so I might move this, but...you mention
we can exhaustively type enumeration members [...] Of course, that's just a really bad form of multiple dispatch. Plum would solve the same quandary considerably more elegantly, assuming typing.Literal support
So yeah, and in reality pattern matching can kinda do this to an extent. And plum is definitely the way to do sum type destructuring, IMO. But my issue is with sum type construction, so what bothers me is the need for these annoying little (user-defined) tags, such that no two libraries implementing sum types will ever really have a common protocol. See: the pydantic thing above, or the janky mypy assert_never
as seen here (and in your example), or even better, the wikipedia entry for tagged union implementation in python :roll_eyes:.
SO what I'm thinking is... beartype could (probably?) be... coaxed... into emulating type tags (and therefore, ADTs muahahaha). After all, we're pretty much talking about one more kind of type arithmetic that needs to be supported (the disjoint union), and the new TypeHint object seems like a perfect candidate for representing something's true type information as metadata.
The second is the bad one. But a function that goes type_of(x:Any) -> typ:TypeHint is going to be necessary. My point was that we can do the "caching"/time-savings ahead of time by restricting the space we search for typ in, by using what the dispatcher was told x is expected to be, and organizing it into a useful hierarchy (search tree).
Is a function type_of
truly necessary? I think I previously convinced myself that we can get by entirely without a function type_of
! š
More specificially, the previously outlined isinstance
-method (i.e. is_bearable
) only uses isinstance
and the order by TypeHint
and completely avoids all use of type_of
. type_of
only comes into play when we need to do cachingāfor correct functioning, I believe we don't need a type_of
at all.
hat's the essence of my original proposal. But IMO it would be awesome to have "real" sum types/tagged unions/ADT's, so that we can treat python as type-erased the way that (apparently) Guido wanted us to š.
How would an ideal solution to this problem look like to you? I.e., how would you really like the code to look like?
For all intents and purposes, we can (only in the context of a dispatcher, mind you), do a typeof_NEAREST_NEIGHBOR that is super fast.
I think I'm becoming increasingly convinced that we might not be able to do an implementation of type_of
that works correctly in all cases without manually specifying how to handle more complex types, even when only consider signatures for a particular function. For example, if we have methods f(x: int)
, f(x: Addable)
(Addable
the type of all objects implementing __add__
), f(x: Literal[1])
, some complicated logic may be required to infer that type_of(1.0)
should give something that is a subtype of Addable
.
@posita: People love
numerary
. You're famous. Happy Halloween. :jack_o_lantern:
Perhaps technically correct in that more than one person qualifies as "people" (shout out to @tbsexton), but if I'm famous, someone forgot to tell me. Maybe I should contact the Department of Internet Money and see how many millions I'm entitled to. :wink:
I was not aware of
numerary
. @posita, that's really nice work!!
Thanks! And by "nice work" I will take it to mean the type that never should have been required in the first place and should be obsoleted by the standard library as quickly as possible. The bad news is that we're not there yet. The good news is that there's a discussion, if that's something that interests you. The bad news is that it has kind of stalled. The good news is ... hang on, I'm thinking ... I've been spending some casual time wrestling with what an algorithmic generic might look like. The bad news is that I haven't gotten very far, probably because it's hard and because I've been distracted by other things that life has a way of throwing at one wrestling with tough problems without the benefit of complete creative freedom afforded by elective solitude and inherited financial independence. Alas, I fear I will never qualify for such niceties, and must make do with what I do have (and for which I am grateful). Patience is humbly appreciated! :bow:
@wesselb For that last example, "stuff that implements __add__
" is precisely this protocol:
class Addable(T.Protocol):
def __add__(self, other):
...
TypeHint(T.Literal[1]) < TypeHint(int) < TypeHint(Addable)
>>> True
Even though we're in structural-typing land now, beartype.door
is working pretty great.
If all of those typehints were registered with a dispatcher, we could still strictly choose the "smallest" one that qualifies: a 1
goes to T.Literal[1]
, a 2
goes to int
, and 'hi'
goes to Addable
.
How would an ideal solution to this problem look like to you? I.e., how would you really like the code to look like?
I suppose I should start working on this :sweat_smile: Gimme a little bit, I'll see if I can sketch some things out.
@tbsexton I fully agree that DOOR works beautifully here and gives us the order that weāre after! (@leycec, this really is fantastic!) My point was that implementing a function typeof in this case is like solving the dispatch problem in the first place. That is, the logic to determine that 2 should go to int is in this case equivalent to determining that f(2) should dispatch to the method implemented by int. But perhaps weāre in agreement! I should have a prototype working soon(tm).
@posita Thatās certainly something Iām interested in! Itās exciting to see that there is discussion around this. I can fully imagine that what youāre after is a tall ask. :)
PS: The GitHub mobile website doesnāt allow one to react to posts with emojiās? Thatās a bummer.
Hi, thanks for the nice project!
I found that the _types_of_iterable could be very slow if the iterable is very large. This could be a big problem for practical application. Is there any solution for this?