Bears-R-Us / arkouda

Arkouda (αρκούδα): Interactive Data Analytics at Supercomputing Scale :bear:
Other
243 stars 88 forks source link

Typeguard error on multi-level GroupBy + .aggregate #858

Closed reuster986 closed 3 years ago

reuster986 commented 3 years ago

Reproducer:

keys = [ak.randint(0, 10, 100), ak.randint(0, 10, 100)]
g = ak.GroupBy(keys)
g.min(ak.randint(0, 10, 100))

[stack trace involving typeguard functions]

NameError: name 'Categorical' is not defined

This error only triggers when more than one array is passed to GroupBy and then an aggregation method is called. I'm not sure the unit tests cover that case.

How do we handle a forward reference like Categorical with typeguard?

Is there a way to turn off runtime type checking by default for users? They should not have to deal with type "errors" on functions that actually work fine. The typeguard documentation says you can turn it off by enabling optimization in the python interpreter, but it's not clear how to do that in a jupyter notebook.

@hokiegeek2 @glitch @pierce314159 I know I wrote the code that introduced this bug, but could one of you take on this issue, since it is a major blocker with no workaround, and since I don't have any telework scheduled this week? Thanks in advance...

hokiegeek2 commented 3 years ago

@reuster986 we will rock this out ASAP. @pierce314159 I am gonna take a look as well right now.

reuster986 commented 3 years ago

Thanks, guys! I'll take a look at the PR now.

stress-tess commented 3 years ago

Is there a way to turn off runtime type checking by default for users? They should not have to deal with type "errors" on functions that actually work fine. The typeguard documentation says you can turn it off by enabling optimization in the python interpreter, but it's not clear how to do that in a jupyter notebook.

Main Takeaway: @reuster986 I did some looking into this and seems like users need to run %env PYTHONOPTIMIZE=1 in their jupyter notebooks to disable typechecking. This disables all assert statements and code dependent on __debug__

Random details :) I think you're referring to this Typeguard Issue: [Disable type checking on production](https://github.com/agronholm/typeguard/issues/54), which says > Type checks can be fairly expensive so it is recommended to run Python in "optimized" mode (python -O or setting the PYTHONOPTIMIZE environment variable) when running code containing type checks in production. The optimized mode will disable the type checks, by virtue of removing all assert statements and setting the __debug__ constant to False. To set the `PYTHONOPTIMIZE` environment variable in a jupyter notebook [`%` magic commands](https://stackoverflow.com/questions/37890898/how-to-set-env-variable-in-jupyter-notebook/47326270) seem to work So `%env` will print out current environment variables and `%env MY_VAR=MY_VALUE` or `%env MY_VAR MY_VALUE` can be used to set them According to the [python docs](https://docs.python.org/3/using/cmdline.html#cmdoption-o), `PYTHONOPTIMIZE` has three possible values: 0, 1 (`-O`) or 2 (`-OO`). 0 is the default, nothing happens. `%env PYTHONOPTIMIZE=0`: - Is the same as default, nothing happens `%env PYTHONOPTIMIZE=1` (equivalent to `-O`): - Removes assert statements and any code conditional on the value of `__debug__` - Augments the filename for compiled (bytecode) files by adding .opt-1 before the .pyc extension `%env PYTHONOPTIMIZE=2` (equivalent to `-OO`): - Does everything `-O` does - Discards docstrings - Augment the filename for compiled (bytecode) files by adding .opt-2 before the .pyc extension
reuster986 commented 3 years ago

@pierce314159 Awesome! TIL... thank you!!

stress-tess commented 3 years ago

@reuster986 hahaha don't thank me yet 😬

I was sanity checking that this works and I'm having some issues. I undid the hotfix and was testing your reproducer

>>> os.system('echo $PYTHONOPTIMIZE')
>>> %env PYTHONOPTIMIZE=1
env: PYTHONOPTIMIZE=1
>>> os.system('echo $PYTHONOPTIMIZE')
1
>>> keys = [ak.randint(0, 10, 100), ak.randint(0, 10, 100)]
>>> g = ak.GroupBy(keys)
>>> g.min(ak.randint(0, 10, 100))
NameError: name 'Categorical' is not defined

but if I set the variable from the terminal with $ export PYTHONOPTIMIZE=1,

>>> os.system('echo $PYTHONOPTIMIZE')
1
>>> keys = [ak.randint(0, 10, 100), ak.randint(0, 10, 100)]
>>> g = ak.GroupBy(keys)
>>> g.min(ak.randint(0, 10, 100))
([array([0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 8 8 8 8 8 8 8 9 9 9 9 9]),
  array([0 1 3 4 5 7 8 9 0 4 5 6 7 8 9 0 1 4 5 6 8 9 0 1 2 4 5 7 8 9 4 5 6 7 0 1 2 3 5 6 7 8 9 0 1 2 3 6 7 8 9 0 2 4 5 7 8 9 0 1 2 5 6 7 9 4 6 7 8 9])],
 array([2 5 5 7 2 1 9 9 0 4 3 7 9 2 6 8 1 8 5 3 0 1 6 9 9 8 0 6 3 2 1 2 9 5 0 6 0 8 3 2 1 1 1 4 0 5 9 4 4 2 0 1 3 7 4 3 0 8 4 1 0 1 1 0 0 5 8 5 7 9]))

I was trying to figure out how to "refresh" the environment variables and came across this which doesn't make me feel confident there's a way to do this from inside ipython/jupyter notebook

https://www.nylas.com/blog/making-use-of-environment-variables-in-python/

It’s important to remember that the settings you apply in a Python script don’t work outside that specific process; os.environ doesn’t overwrite the environment variables system-wide. If you need to permanently delete or set environment variables you will need to do so with a shell environment, such as Bash.

if the users have access to a terminal then

$ export PYTHONOPTIMIZE=1

works like a charm. I know @glitch has more experience with python environment stuff so maybe he or @hokiegeek2 know something I don't. I'm leaving a bit early today, but I'll look into this more next week

reuster986 commented 3 years ago

I'm going to close this, since the specific issue has been patched and the more general issue with typeguard has a workaround in setting PYTHONOPTIMIZE=1.