aroberge / ideas

Easy creation of custom import hooks to experiment on alternatives to Python's syntax; see https://aroberge.github.io/ideas/docs/html/
Other
79 stars 4 forks source link

New example: support a more wide range of unicode identifiers #13

Open skirpichev opened 3 years ago

skirpichev commented 3 years ago

Python do NKFC-normalization while parsing identifiers. That disallow some fancy unicode identifiers like ℕ (it will be N for Python), see e.g. this. Other languages, that support unicode identifiers usually lack this "feature" and/or use different normalization, like Julia. E.g. the Scheme:

$ scheme
MIT/GNU Scheme running under GNU/Linux
...
1 ]=> (define ℕ 1)

;Value: ℕ

1 ]=> (define N 2)

;Value: n

1 ]=> ℕ

;Value: 1

1 ]=> N

;Value: 2

It's possible to "patch" this unfortunate feature with transform_source-based transformation: parse source to ast tree, then "fix" normalized identifiers, using lineno/col_offset/etc into something like N_1, instead of in the original source. This might look tricky, but I think this will fit nicely into your collection of examples: it combines ast parsing and some parsing of the original source string (i.e. with tokenize) to get disallowed symbols back.

aroberge commented 3 years ago

This is definitely an interesting example. I would likely use something like N_uuid where uuid is a unique identifier, so as to avoid clash with any existing variable, like I already do using https://github.com/aroberge/ideas/blob/72d84ec2ce49bc0807597749496a2c957f3db6cb/ideas/utils.py#L68 in some examples, but making sure that the same uuid is used for a given variable name.

It might require to patch some builtins like dir in order to show the non-normalized identifier instead of the.normalized form.

To properly demonstrate this, I would likely need to "fix" the console so that it works "properly" with AST transformations: currently, one requires to use print explicitly to see the value of a variable.

= = = I do programming on my spare time as a hobby (I do not use programming in my work). Currently, my priority is to work on https://github.com/aroberge/friendly. However, I will try to find some time to work on this as it seems like a good way for me to learn more about AST transformations.

skirpichev commented 3 years ago

I would likely use something like N_uuid where uuid is a unique identifier

There are more efficient ways, of course, but this looks fine for an example.

It might require to patch some builtins like dir

Yes, I think so. For an industry-grade solution I would expect that the inspect module should be also altered, e.g. inspect.signature(). Maybe something else from the stdlib. But for an example - dir() is enough.

However, I will try to find some time to work on this as it seems like a good way for me to learn more about AST transformations.

If it's a good idea, in your view - I'll try to implement this.

aroberge commented 3 years ago

Please feel free to go ahead.

I'm thinking of a potentially "simpler" approach where the source is transformed at the tokenization stage, so that no AST transformation would be required; I think I could make this work ... but there might be some advantages to doing AST transformations that I can no see due to my lack of knowledge.

My mind has been coming back to this idea while doing some other work; I definitely find this an interesting example.

skirpichev commented 3 years ago

I'm thinking of a potentially "simpler" approach where the source is transformed at the tokenization stage

I'm not sure how robust it could be. But the tokenize() does preserve "disallowed" unicode symbols, as I noted before.

skirpichev commented 3 years ago

Probably, you were right: after some playing with the token-based approach, I don't think it will break things. And it's simple, indeed:

import io
import tokenize
import unicodedata
import uuid

from ideas import import_hook

_NAMES_MAP = {}

def fix_names(source, **kwargs):
    result = []
    g = tokenize.tokenize(io.BytesIO(source.encode()).readline)
    for toknum, tokval, _, _, _ in g:
        if toknum == tokenize.NAME:
            if unicodedata.normalize('NFKC', tokval) != tokval:
                if tokval not in _NAMES_MAP:
                    _NAMES_MAP[tokval] = f'_{uuid.uuid4().hex!s}'
                tokval = _NAMES_MAP[tokval]
        result.append((toknum, tokval))
    return tokenize.untokenize(result).decode()

def source_init():
    return """
old_dir = dir
def dir(obj):
    result = old_dir(obj)
    for k, v in _NAMES_MAP.items():
        result = [_.replace(v, k) for _ in result]
    return sorted(result)
"""

import_hook.create_hook(source_init=source_init, transform_source=fix_names)

Session example:

$ python -i fix_names.py 
>>> from ideas import console
>>> console.configure(console_dict={'_NAMES_MAP': _NAMES_MAP})
>>> console.start()
Configuration values for the console:
    console_dict: {'_NAMES_MAP': {}}
    source_init: <function source_init at 0x7f61284d58b0>
    transform_source: <function fix_names at 0x7f61288da040>
--------------------------------------------------
Ideas Console version 0.0.19. [Python version: 3.9.2]

~>> class A:
...     ℕ = 1
... 
~>> A.N
Traceback (most recent call last):
  File "IdeasConsole", line 1, in <module>
AttributeError: type object 'A' has no attribute 'N'
~>> A.ℕ
1
~>> dir(A)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__',
'__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__',
'__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__',
'__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'ℕ']

(BTW, maybe locals() should be added to the console_dict per default?)

aroberge commented 3 years ago

Very nice! Your suggestion of adding locals() by default makes sense; this is essentially what I do with another project (friendly).

In addition to modifying dir(), one would probably need to modify vars(), and perhaps locals() and globals(). Tracebacks might be tricky to decipher unless they are decoded as well.

I'm thinking that this example should be included as "extended_unicode". I'll try to do this tomorrow and perhaps writing a blog post about it, giving you full credit for the idea and implementation. I like it: it is very much in the spirit of what I had in mind when I created this project.

skirpichev commented 3 years ago

In addition to modifying dir(), one would probably need to modify vars(), and perhaps locals() and globals(). Tracebacks might be tricky to decipher unless they are decoded as well.

Maybe. Or we can just mention other pitfails of this approach: this is an example, right? But dir() might be not a best choice to illustrate a possible solution.

I'm thinking that this example should be included as "extended_unicode".

I've added this transformer as the unicode_identifiers() function, because it allows us any unicode string as an identifier (again, probably this is not a good idea for a professional code: perhaps, Julia-like normalization is more suitable for math). But I'm not good in naming, anyway.

I'll try to do this tomorrow and perhaps writing a blog post about it, giving you full credit for the idea and implementation. I like it: it is very much in the spirit of what I had in mind when I created this project.

Thank you. I was planning to finish a PR for this, but if you have time to do this yourself (better naming, dir() & co workarounds and, especially, tests were tricky for me) - probably it would be better.

aroberge commented 3 years ago

I've uploaded a new version to PyPI. I made a few relatively minor changes to your code.

  1. Normally, dir() can work with no arguments: in this case, it shows some "interesting names" from the local scope. The revised dir() required an object to be passed to it. I could not get this to work reliably when trying to redefine it so that it would show the same thing as dir(). What I did instead was to define a new function, called ndir(), as demonstrated below.
  2. I changed the name of the mapping directory to __NAMES_MAP so that it starts with two leading underscores. The idea is to be able to filter out names that start with double underscore, so as to more easily compare the result of using dir() and ndir()
  3. I called this example "unnormalized_unicode.py" which, I think, sums up better what the idea is. Perhaps there is yet a better name.
  4. I made a few other changes so that the console does not have to be explicitly configured.

Here's a sample session with the new code.

>>> from ideas.examples import unnormalized_unicode
>>> from ideas.console import start
>>> start()
Configuration values for the console:
    console_dict: {'__NAMES_MAP': {}, 'ndir': <function ndir at 0x018CAD20>}
    transform_source: <function fix_names at 0x0167C540>
--------------------------------------------------
Ideas Console version 0.0.20. [Python version: 3.7.8]

~>> ℕ = 1
~>> N = 2
~>> ℕ
1
~>> dir()
['N', '_8dab3ef5fc2949e992deda99acbfb037', '__NAMES_MAP', '__builtins__', 'ndir']
~>> ndir()
['N', '__NAMES_MAP', '__builtins__', 'ndir', 'ℕ']
~>> class A:
...    ℕ = 1
...    N = 2
...
~>> def interesting(names):
...    return [n for n in names if not n.startswith("__")]
...
~>> interesting(dir(A))
['N', '_8dab3ef5fc2949e992deda99acbfb037']
~>> interesting(ndir(A))
['N', 'ℕ']

As I mentioned, I still have to write documentation for it (and probably a blog post), but that will have to wait for a bit as I want to think some more and see if this could not be improved further.

There is still the idea of passing locals() to the console which I need to think about...

aroberge commented 3 years ago

Just an additional thought I had ... for easier comparison, I think that the new names should start with the normalized name followed by an underscore and the uuid (which could be probably truncated a bit). So, in the example above, would map to something like N_8dab3ef5fc2949e...

aroberge commented 3 years ago

New version uploaded with last mentioned change implemented. Here's a snipped showing the result:

~>> ℕormal = 3
~>> ndir()
['__NAMES_MAP', '__builtins__', 'ndir', 'ℕormal']
~>> dir()
['Normal_98a22058a2aa4b31a900c8b215ea09c5', '__NAMES_MAP', '__builtins__', 'ndir']
~>> ℕormal
3
aroberge commented 3 years ago

Playing with yet a new version (not uploaded to pypi). ideas' version number shown here has not yet been changed to reflect the latest version.

>>> from ideas.examples import unnormalized_unicode
>>> from ideas.console import start
>>> start()
Configuration values for the console:
    console_dict: {'__NAMES_MAP': {}, 'ndir': <function ndir at 0x018EACD8>}
    source_init: <function source_init at 0x018EAD20>
    transform_source: <function transform_names at 0x00ADC540>
--------------------------------------------------
Ideas Console version 0.0.21. [Python version: 3.7.8]

~>> dir()
['__builtins__']
~>> ℕ = 1
~>> dir()
['__builtins__', 'ℕ']
~>> true_dir()
['N_fa969ab27247436c8c5350151e606b57', '__NAMES_MAP', '__builtins__', 'dir', 'true_dir']
skirpichev commented 3 years ago

I could not get this to work reliably when trying to redefine it so that it would show the same thing as dir().

I would prefer to fix the "patched" dir(), if possible. It shouldn't match the original dir() behaviour exactly. But the interface should be same, i.e. the no-args version of dir().

I think that the new names should start with the normalized name followed by an underscore and the uuid

Yes, this looks better. Maybe you can add some common prefix/suffix to simplify filtering/selecting such variables.

BTW, I'm planning to use the ideas (Ideas Console or just import_hook) for the Diofant's command-line interface (https://github.com/diofant/diofant/pull/853). Will you, eventually, factor-out the import_hook/console stuff to some separate library/ies?

aroberge commented 3 years ago

I would prefer to fix the "patched" dir(), if possible. It shouldn't match the original dir() behaviour exactly. But the interface should be same, i.e. the no-args version of dir().

I believe it I got it to work now as similarly as possible compared with the original dir().

Yes, this looks better. Maybe you can add some common prefix/suffix to simplify filtering/selecting such variables.

One possibility I thought of was to automatically exclude variables that start with a double underscore as these are often methods that are of no interest. For example:

~>> class A:
...    ℕ = 1
...
~>> dir(A)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'ℕ']

I suspect that all the so-called magic method, that is those that do start and end with double underscores, would be of no interest to most "casual" users or users of projects with their own consoles, like Diofant.

Would such a filtering be useful for your project?

Fro this example, I am thinking of keeping the original dir around, under another name. Currently it is available as true_dir, but perhaps you can think of a better name. That being said, I realize that you have already included a working version of this in Diofant, so this might not be relevant for your project...

BTW, I'm planning to use the ideas (Ideas Console or just import_hook) for the Diofant's command-line interface (diofant/diofant#853). Will you, eventually, factor-out the import_hook/console stuff to some separate library/ies?

I was not planning to do any such factoring out. In my mind, ideas is very much a toy project used to explore different possibilities of changing the way Python works. For that purpose, I believe it is important to include all the examples. I did not expect it to be found useful in any real-life project; however, I can see how this could be the case for Diofant.

One thing I can do that might be useful is to filter out the original message about configuration values for the console by default, and only show them with something like start(show_config=True). That is, hide the following:

>>> start()
Configuration values for the console:
    console_dict: {'__NAMES_MAP': {}, 'ndir': <function ndir at 0x01ABACD8>}
    source_init: <function source_init at 0x01ABAD20>
    transform_source: <function transform_names at 0x0137C540>
--------------------------------------------------
Ideas Console version 0.0.21. [Python version: 3.7.8]

I can also change it so that the message shown with the name of the console, its version, etc., is easily configurable; something like start(banner=BANNER). Similarly, I could make the prompt easily configurable as an argument of start().

I should be able to do this today and release a new version with these changes.

Finally, for AST transformations, the repl does not "echo" back the value of names or the value of statements without an explicit print statement. I would think that fixing this would be useful for projects like Diofant. I think there might be a way of making this work for simples cases where one just wants to see the value of a variable, but I don't currently know how to have it reproduce the exact behaviour of the Python's REPL. What would be needed for Diofant?

aroberge commented 3 years ago

As it turns out, Python 3.9+ includes an unparse function in the ast module. This makes it possible to do AST transformations, transform them back into valid/normal Python code, and use the usual way to compile a source in the interactive interpreter, so that it is not needed to use print() to see the output.

>>> from ideas.examples import fractions_ast
>>> hook = fractions_ast.add_hook()
>>> from ideas.console import start
>>> start()
Ideas Console version 0.0.21. [Python version: 3.9.5]

~>> 1/2
Fraction(1, 2)
~>> a = _
~>> a
Fraction(1, 2)

I noticed that Diofant requires Python 3.9 ... which is perfect for this.

I have uploaded a new version to Pypi which includes this change.

This new version also includes the other changes mentioned for the console (hiding the configuration values, configurable prompt and banner), etc.

However, it is not fully tested, but the quick interactive tests I did all worked as I expected them to.

I will definitely need to update the documentation to reflect all of these changes.

skirpichev commented 3 years ago

One possibility I thought of was to automatically exclude variables that start with a double underscore

That will be a different dir().

Would such a filtering be useful for your project?

I don't think so.

I realize that you have already included a working version of this in Diofant, so this might not be relevant for your project...

Not really. There is a very early version, that included as POC and not exposed so far for end users (ex. with a CLI option).

One thing I can do that might be useful is to filter out the original message about configuration values for the console by default, and only show them with something like start(show_config=True)

Maybe. But I don't use start() interface. Instead, I do subclass the IdeasConsole.

Finally, for AST transformations, the repl does not "echo" back the value of names or the value of statements without an explicit print statement.

I'm not sure I understand you. Everything seems to be working for the DiofantConsole:

$ python -m diofant --no-ipython
>>> a = 1
>>> a
1
>>> x  # this Symbol imported per default
x
>>> 1/2  # ast works!
1/2
>>> repr(_)
'Rational(1, 2)'
skirpichev commented 3 years ago

UPD:

Finally, for AST transformations, the repl does not "echo" back the value of names or the value of statements without an explicit print statement. I would think that fixing this would be useful for projects like Diofant.

I think I got it. This happens, for example, for the AutomaticSymbols() ast transformer. C.f. the IPython session:

$ python -m diofant -a

In [1]: a
Out[1]: a

In [2]:                                                 

and the IdeasConsole:

$ python -m diofant -a --no-ipython
>>> a  # no echo!
>>> a  # now it prints
a
>>> 

FYI: the given ast transformer does the following transformation in this case:

Input:
a
Output:
a = Symbol('a')
a

I did the following workaround:

--- console.py.orig 2021-06-16 08:20:05.669683618 +0300
+++ console.py      2021-06-16 08:20:13.853350692 +0300
@@ -138,6 +138,8 @@
             if hasattr(ast, 'unparse'):
                 try:
                     source = ast.unparse(tree)
+                    source = source.split("\n")
+                    source = ";".join(source)
                 except RecursionError:
                     code_obj = compile(tree, filename, "exec")
                 else: