Perl needs to normalize its identifiers

p5pRT commented 13 years ago

Migrated from rt.perl.org#96814 (status was 'open')

Searchable as RT96814$

p5pRT commented 13 years ago

From tchrist@perl.com

Python runs its Unicode identifiers through NFD transforms\, although Perl\, Ruby\, and Java do not. That means a user has to know which form all his idents are in\, and which form his editor condescended to enter for him\, even though he cannot see which is which in his editor. This is prone to bugs and errors\, some of which will go long unnoticed.

*You* cannot tell which one got entered\, and *you* cannot see which is which\, but Perl distinguished otherwise identifical things.

How can this possibly not be a bug?

I get figure out a tie map for hashes to make this work right\, so that your strings are autonormalized\, but I cannot figure out how to do that sort of magic to lookups in stashes\, let alone in pads.

Since this is something each user must take especially care to do "right" every single time\, or else he gets bugs\, it is something that Perl should be doing for him\, based on the proven principle that nothing too important to risk bieng forgotten should be *able* to be forgotten.

--tom

Summary of my perl5 (revision 5 version 14 subversion 0) configuration:
Platform: osname=openbsd\, osvers=4.4\, archname=OpenBSD.i386-openbsd uname='openbsd chthon 4.4 generic#0 i386 ' config_args='-des' hint=recommended\, useposix=true\, d_sigaction=define useithreads=undef\, usemultiplicity=undef useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef use64bitint=undef\, use64bitall=undef\, uselongdouble=undef usemymalloc=y\, bincompat5005=undef Compiler: cc='cc'\, ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'\, optimize='-O2'\, cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion=''\, gccversion='3.3.5 (propolice)'\, gccosandvers='openbsd4.4' intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=12 ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8 alignbytes=4\, prototype=define Linker and Libraries: ld='cc'\, ldflags ='-Wl\,-E -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /usr/lib libs=-lgdbm -lm -lutil -lc perllibs=-lm -lutil -lc libc=/usr/lib/libc.so.48.0\, so=so\, useshrplib=false\, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags=' ' cccdlflags='-DPIC -fPIC '\, lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl): Compile-time options: MYMALLOC PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO USE_PERL_ATOF Built under openbsd Compiled at Jun 11 2011 11:48:28 %ENV: PERL_UNICODE="SA" @INC: /usr/local/lib/perl5/site_perl/5.14.0/OpenBSD.i386-openbsd /usr/local/lib/perl5/site_perl/5.14.0 /usr/local/lib/perl5/5.14.0/OpenBSD.i386-openbsd /usr/local/lib/perl5/5.14.0 /usr/local/lib/perl5/site_perl/5.12.3 /usr/local/lib/perl5/site_perl/5.11.3 /usr/local/lib/perl5/site_perl/5.10.1 /usr/local/lib/perl5/site_perl/5.10.0 /usr/local/lib/perl5/site_perl/5.8.7 /usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl/5.6.0 /usr/local/lib/perl5/site_perl/5.005 /usr/local/lib/perl5/site_perl .

p5pRT commented 13 years ago

From @Hugmeir

On Thu\, Aug 11\, 2011 at 4:39 PM\, tchrist1 \perlbug\-followup@perl\.org wrote:

Python runs its Unicode identifiers through NFD transforms\, although Perl\, Ruby\, and Java do not.

Does Python use NFD? PEP 3131 recommends either NFC or NFKC\, but I haven't gotten too far into the accompanying discussion.

In any case\, I agree that this needs to change\, but I have doubts on how it would be called from Perl-space. 'use normalization qw\< NFD >;' implies that all of the source is normalized\, including string literals\, so you'd actually need to do something like 'use normalization indentifiers => "NFD";' to avoid confusion... But that gives the impression that you can also normalize other areas. And what about symbolic references\, should those be normalized too? Can you opt(in|out) of that? :)

I get figure out a tie map for hashes to make this work right\, so that your strings are autonormalized\, but I cannot figure out how to do that sort of magic to lookups in stashes\, let alone in pads.

Tieing stashes is broken\, so that won't do for the moment. Without giving it much thought\, I imagine we could "simply" add checks in the core\, or maybe install store/fetch hooks for GVs/pads\, if those aren't a hugely terrible idea.

Unrelated to the bug report\, what does Python do with bidi control characters? The PEP thread has a couple of suggestions ( http://mail.python.org/pipermail/python-3000/2007-May/007750.html\, http://mail.python.org/pipermail/python-3000/2007-May/007823.html\,\<http://mail.python.org/pipermail/python-3000/2007-May/007823.html> http://mail.python.org/pipermail/python-3000/2007-May/007826.html) but I don't how what they ended up implementing.

p5pRT commented 13 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 13 years ago

From tchrist@perl.com

"Brian Fraser via RT" \perlbug\-followup@perl\.org wrote on Fri\, 12 Aug 2011 00:26:34 PDT:

Python runs its Unicode identifiers through NFD transforms\, although Perl\, Ruby\, and Java do not.

Does Python use NFD? PEP 3131 recommends either NFC or NFKC\, but I haven't gotten too far into the accompanying discussion.

Sorry\, you're right\, it's NFC:

#!/usr/bin/env python3.2 # -*- coding: UTF-8 -*- écran = "NFD screen" écran = "NFC screen" print("First screen is"\, écran) print("Second screen is"\, écran)

print out

First screen is NFC screen Second screen is NFC screen

I was worried about how this plays with Apple's HSF+\, given that it uses NFD. If you can a module named Écran\, I get nervous about how it gains a code point in length in the filesystem.

In any case\, I agree that this needs to change\, but I have doubts on how it would be called from Perl-space. 'use normalization qw\< NFD >;' implies that all of the source is normalized\, including string literals\, so you'd actually need to do something like 'use normalization indentifiers => "NFD";' to avoid confusion... But that gives the impression that you can also normalize other areas. And what about symbolic references\, should those be normalized too? Can you opt(in|out) of that? :)

I agree that it has to be just for identifiers\, not string literals\, because there are times you need to compare with something exactly.

$nfd = "écran"; $nfc = "écran";

Those need to be distinct.

I think the solution for hashes should probably be a tie layer that normalizes its keys. That doesn't require any core changes.

I get figure out a tie map for hashes to make this work right\, so that your strings are autonormalized\, but I cannot figure out how to do that sort of magic to lookups in stashes\, let alone in pads.

Tieing stashes is broken\, so that won't do for the moment.

I was kinda just kidding\, because I did remember this.

Without giving it much thought\, I imagine we could "simply" add checks in the core\, or maybe install store/fetch hooks for GVs/pads\, if those aren't a hugely terrible idea.

Unrelated to the bug report\, what does Python do with bidi control characters? The PEP thread has a couple of suggestions (

http://mail.python.org/pipermail/python-3000/2007-May/007750.html\, http://mail.python.org/pipermail/python-3000/2007-May/007823.html\, \<http://mail.python.org/pipermail/python-3000/2007-May/007823.html> http://mail.python.org/pipermail/python-3000/2007-May/007826.html) but I don't how what they ended up implementing.

Haven't looked at that. Bidi is ugly\, since Perl stuff goes left to right\, and an RTL string could flip around weak bidi mirrors so they look different.

Interesting:

I'll repeat that UTR#39 explicitly discourages support for formatting characters in identifiers.

And this one

http://mail.python.org/pipermail/python-3000/2007-May/007725.html

points out that Java can get away with this because they have all these default-ignorables they let by in source code. Yes\, you can put nulls and bells all over your Java source and the compiler will ignore them outside literals. Scary.

This

http://mail.python.org/pipermail/python-3000/2007-May/007833.html

seems as far as they got. I don't see any resolution. Too tired to hack out stupid bidi tricks right now to test.

Hm\, I wonder whether this has anything useful to say about the matter\, since they've had to think about it for URLs:

http://www.w3.org/International/iri-edit/draft-duerst-iri-05.txt

--tom

p5pRT commented 13 years ago

From @nwc10

On Fri\, Aug 12\, 2011 at 02:10:55AM -0600\, Tom Christiansen wrote:

I was worried about how this plays with Apple's HSF+\, given that it uses NFD. If you can a module named Écran\, I get nervous about how it gains a code point in length in the filesystem.

Strictly it doesn't:

http://developer.apple.com/library/mac/technotes/tn/tn1150.html#UnicodeSubtleties

IMPORTANT:

An implementation must not use the Unicode utilities implemented by its native platform (for decomposition and comparison)\, unless those algorithms are equivalent to the HFS Plus algorithms defined here\, and are guaranteed to be so forever. This is rarely the case. Platform algorithms tend to evolve with the Unicode standard. The HFS Plus algorithms cannot evolve because such evolution would invalidate existing HFS Plus volumes.

It's a snapshot of NFD - I think even a snapshot of a late NFD *draft*. And it's not allowed to change.

Which I think was an issue Father C raised - Unicode evolves\, therefore normalisation changes. Should Perl snapshot a particular normalisation and keep that as canonical forever? Or should we run the (small risk) that (dangerously written) scripts will change behaviour as a side effect of running on a perl (newer or older) that doesn't use the same Unicode database.

This doesn't seem to be addressed at all in PEP 3131\, so I'm assuming that there isn't a working Python solution to adopt.

This
http&#8203;://mail\.python\.org/pipermail/python\-3000/2007\-May/007833\.html
seems as far as they got. I don't see any resolution. Too tired to hack out stupid bidi tricks right now to test.

Shame.

Does any language have a working implementation of normalised Unicode identifiers?

Nicholas Clark

p5pRT commented 13 years ago

From tchrist@perl.com

Nicholas Clark \nick@ccl4\.org wrote on Fri\, 12 Aug 2011 10:23:09 BST:

On Fri\, Aug 12\, 2011 at 02:10:55AM -0600\, Tom Christiansen wrote:

I was worried about how this plays with Apple's HSF+\, given that it uses NFD. If you can a module named Écran\, I get nervous about how it gains a code point in length in the filesystem.

Strictly it doesn't:

...

It's a snapshot of NFD - I think even a snapshot of a late NFD *draft*. And it's not allowed to change.

I usually hedge that by saying that it's quasi-NFD. I don't know any module that implements it\, so it's really annoying to predict. I hate the poke it and see what shows up approach\, but maybe that's all one can do.

Which I think was an issue Father C raised - Unicode evolves\, therefore normalisation changes. Should Perl snapshot a particular normalisation and keep that as canonical forever? Or should we run the (small risk) that (dangerously written) scripts will change behaviour as a side effect of running on a perl (newer or older) that doesn't use the same Unicode database.

Is the fear that an unassigned code point would later get assigned something that changes under normalization? If people are using unassigned code points\, then I suppose this may happen\, but I can't see any other way. That's because of Unicode's strong stability guarantee on normalization. The key point is the last of the lines I quote below:

http://unicode.org/policies/stability_policy.html

Unlike many other standards\, the Unicode Standard is continually expanding—new characters are added to meet a variety of uses\, ranging from technical symbols to letters for archaic languages. Character properties are also expanded or revised to meet implementation requirements.

In each new version of the Unicode Standard\, the Unicode Consortium may add characters or make certain changes to characters that were encoded in a previous version of the standard. However\, the Consortium imposes limitations on the types of changes that can be made\, in an effort to minimize the impact on existing implementations.

...

Normalization Stability

Strong Normalization Stability Applicable Version: Unicode 4.1+

If a string contains only characters from a given version of Unicode\, and it is put into a normalized form in accordance with that version of Unicode\, then the results will be identical to the results of putting that string into a normalized form in accordance with any subsequent version of Unicode.

More formally\, given versions V and U of Unicode\, and any string S which only contains characters assigned according to both V and U\, the following are always true:

toNFCV(S) = toNFCU(S) toNFDV(S) = toNFDU(S) toNFKCV(S) = toNFKCU(S) toNFKDV(S) = toNFKDU(S)

In particular\, once a character is encoded\, its canonical combining class and decomposition mapping will not be changed in any way.

Now\, HSF+ came out in 1998\, but the stability guarantee only applies to Unicode version 4.1 and up\, and 4.1 itself came out 2005-03-31.

This doesn't seem to be addressed at all in PEP 3131\, so I'm assuming that there isn't a working Python solution to adopt.

I can't see that they've done anything about bidis.

Does any language have a working implementation of normalised Unicode identifiers?

What exactly do you mean by this? As I said\, Python runs them through NFC. This may have ramifications on HFS+. Python issue 11230 is about being able to import library modules with non-ASCII names\, as

http://bugs.python.org/issue11230

And in particular

http://bugs.python.org/msg128724

which reads:

Short answer:

In Python 3.2\, « import héhé » doesn't work on Windows\, but you can have non-ASCII paths in sys.path.

Longer answer:

I fixed the import machinery to handle correctly non-ASCII characters in module *paths*. But the import machinery is unable to handle non-ASCII characters in module *names*: it fails if the filesystem encoding is not UTF-8 (eg. it fails on Windows). There is another exception: Python doesn't support (yet) non encodable module paths on Windows. On Windows\, you can use any character in directory names\, but Python 3.2 encodes paths to the filesystem encoding (ANSI code page) which is a smaller charset. In practical\, this Windows specific limitation on module paths doesn't really matter.

I plan to fix all these issues in Python 3.3: see #3080.

--

> Could you please make it clear in documentation and web pages\, > that this feature is not working yet.

What's New in Python 3.2 documentation has this sentence: "Python’s import mechanism can now load modules installed in directories with non-ASCII characters in the path name. This solved an aggravating problem with home directories for users with non-ASCII characters in their usernames." which is correct.

Which web page should updated/fixed?

So I don't think they have it working in module names either. Besides Perl\, all of Python\, Ruby\, Java\, and Go offer Unicode identifiers\, with various restrictions.

* Python does seem to do the IDS/IDC thing\, so you might see idents with combining marks\, but these are run through NFC so tend to go away for the common cases.

* Java I know to have filesystem issues\, but Java also allows for random control characters in its identifiers\, which it completely ignores and do not become part of those names.

* In contrast Go does not seem to use IDS/IDC\, because you get compiler errors if you have combining marks (NFD forms):

% 6g idents.go idents.go:4: invalid identifier character 0x301 idents.go:5: invalid identifier character 0x301

% uniquote -x \< idents.go package main func main() { var \x{E9}cran = "NFC screen" var e\x{301}cran = "NFD screen" println("tes \x{E9}crans sont "\, \x{E9}cran\, " and "\, e\x{301}cran) }

So it doesn't mind E9\, but dislikes 301.

(BTW\, I keep making errors in Python because of there being no strict vars declaration that I can find the equivalent of\, whereas with Go you don't have that problem.)

* I haven't poked at Ruby hard enough to know what it does here with external names. But internally\, NFC and NFD forms are distinct instead of normalized:

% ruby ident.ruby nfc nfd

% uniquote -x \< ident.ruby #!/usr/bin/env ruby #coding: utf-8 ni\x{F1}o = "nfc"
nin\x{303}o = "nfd"
puts ni\x{F1}o puts nin\x{303}o

--tom

Perl / perl5