Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.9k stars 540 forks source link

Perl needs to normalize its identifiers #11573

Open p5pRT opened 13 years ago

p5pRT commented 13 years ago

Migrated from rt.perl.org#96814 (status was 'open')

Searchable as RT96814$

p5pRT commented 13 years ago

From tchrist@perl.com

Python runs its Unicode identifiers through NFD transforms\, although Perl\, Ruby\, and Java do not. That means a user has to know which form all his idents are in\, and which form his editor condescended to enter for him\, even though he cannot see which is which in his editor. This is prone to bugs and errors\, some of which will go long unnoticed.

*You* cannot tell which one got entered\, and *you* cannot see which is which\, but Perl distinguished otherwise identifical things.

How can this possibly not be a bug?

I get figure out a tie map for hashes to make this work right\, so that your strings are autonormalized\, but I cannot figure out how to do that sort of magic to lookups in stashes\, let alone in pads.

Since this is something each user must take especially care to do "right" every single time\, or else he gets bugs\, it is something that Perl should be doing for him\, based on the proven principle that nothing too important to risk bieng forgotten should be *able* to be forgotten.

--tom

Summary of my perl5 (revision 5 version 14 subversion 0) configuration​:  
  Platform​:   osname=openbsd\, osvers=4.4\, archname=OpenBSD.i386-openbsd   uname='openbsd chthon 4.4 generic#0 i386 '   config_args='-des'   hint=recommended\, useposix=true\, d_sigaction=define   useithreads=undef\, usemultiplicity=undef   useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef   use64bitint=undef\, use64bitall=undef\, uselongdouble=undef   usemymalloc=y\, bincompat5005=undef   Compiler​:   cc='cc'\, ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'\,   optimize='-O2'\,   cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'   ccversion=''\, gccversion='3.3.5 (propolice)'\, gccosandvers='openbsd4.4'   intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234   d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=12   ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8   alignbytes=4\, prototype=define   Linker and Libraries​:   ld='cc'\, ldflags ='-Wl\,-E -fstack-protector -L/usr/local/lib'   libpth=/usr/local/lib /usr/lib   libs=-lgdbm -lm -lutil -lc   perllibs=-lm -lutil -lc   libc=/usr/lib/libc.so.48.0\, so=so\, useshrplib=false\, libperl=libperl.a   gnulibc_version=''   Dynamic Linking​:   dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags=' '   cccdlflags='-DPIC -fPIC '\, lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl)​:   Compile-time options​: MYMALLOC PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP   PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO   USE_PERL_ATOF   Built under openbsd   Compiled at Jun 11 2011 11​:48​:28   %ENV​:   PERL_UNICODE="SA"   @​INC​:   /usr/local/lib/perl5/site_perl/5.14.0/OpenBSD.i386-openbsd   /usr/local/lib/perl5/site_perl/5.14.0   /usr/local/lib/perl5/5.14.0/OpenBSD.i386-openbsd   /usr/local/lib/perl5/5.14.0   /usr/local/lib/perl5/site_perl/5.12.3   /usr/local/lib/perl5/site_perl/5.11.3   /usr/local/lib/perl5/site_perl/5.10.1   /usr/local/lib/perl5/site_perl/5.10.0   /usr/local/lib/perl5/site_perl/5.8.7   /usr/local/lib/perl5/site_perl/5.8.0   /usr/local/lib/perl5/site_perl/5.6.0   /usr/local/lib/perl5/site_perl/5.005   /usr/local/lib/perl5/site_perl   .

p5pRT commented 13 years ago

From @Hugmeir

On Thu\, Aug 11\, 2011 at 4​:39 PM\, tchrist1 \perlbug\-followup@​perl\.org wrote​:

Python runs its Unicode identifiers through NFD transforms\, although Perl\, Ruby\, and Java do not.

Does Python use NFD? PEP 3131 recommends either NFC or NFKC\, but I haven't gotten too far into the accompanying discussion.

In any case\, I agree that this needs to change\, but I have doubts on how it would be called from Perl-space. 'use normalization qw\< NFD >;' implies that all of the source is normalized\, including string literals\, so you'd actually need to do something like 'use normalization indentifiers => "NFD";' to avoid confusion... But that gives the impression that you can also normalize other areas. And what about symbolic references\, should those be normalized too? Can you opt(in|out) of that? :)

I get figure out a tie map for hashes to make this work right\, so that your strings are autonormalized\, but I cannot figure out how to do that sort of magic to lookups in stashes\, let alone in pads.

Tieing stashes is broken\, so that won't do for the moment. Without giving it much thought\, I imagine we could "simply" add checks in the core\, or maybe install store/fetch hooks for GVs/pads\, if those aren't a hugely terrible idea.

Unrelated to the bug report\, what does Python do with bidi control characters? The PEP thread has a couple of suggestions ( http​://mail.python.org/pipermail/python-3000/2007-May/007750.html\, http​://mail.python.org/pipermail/python-3000/2007-May/007823.html\,\<http​://mail.python.org/pipermail/python-3000/2007-May/007823.html> http​://mail.python.org/pipermail/python-3000/2007-May/007826.html) but I don't how what they ended up implementing.

p5pRT commented 13 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 13 years ago

From tchrist@perl.com

"Brian Fraser via RT" \perlbug\-followup@&#8203;perl\.org wrote   on Fri\, 12 Aug 2011 00​:26​:34 PDT​:

Python runs its Unicode identifiers through NFD transforms\, although Perl\, Ruby\, and Java do not.

Does Python use NFD? PEP 3131 recommends either NFC or NFKC\, but I haven't gotten too far into the accompanying discussion.

Sorry\, you're right\, it's NFC​:

  #!/usr/bin/env python3.2   # -*- coding​: UTF-8 -*-   écran = "NFD screen"   écran = "NFC screen"   print("First screen is"\, écran)   print("Second screen is"\, écran)

print out

  First screen is NFC screen   Second screen is NFC screen

I was worried about how this plays with Apple's HSF+\, given that it uses NFD. If you can a module named Écran\, I get nervous about how it gains a code point in length in the filesystem.

In any case\, I agree that this needs to change\, but I have doubts on how it would be called from Perl-space. 'use normalization qw\< NFD >;' implies that all of the source is normalized\, including string literals\, so you'd actually need to do something like 'use normalization indentifiers => "NFD";' to avoid confusion... But that gives the impression that you can also normalize other areas. And what about symbolic references\, should those be normalized too? Can you opt(in|out) of that? :)

I agree that it has to be just for identifiers\, not string literals\, because there are times you need to compare with something exactly.

  $nfd = "écran";   $nfc = "écran";

Those need to be distinct.

I think the solution for hashes should probably be a tie layer that normalizes its keys. That doesn't require any core changes.

I get figure out a tie map for hashes to make this work right\, so that your strings are autonormalized\, but I cannot figure out how to do that sort of magic to lookups in stashes\, let alone in pads.

Tieing stashes is broken\, so that won't do for the moment.

I was kinda just kidding\, because I did remember this.

Without giving it much thought\, I imagine we could "simply" add checks in the core\, or maybe install store/fetch hooks for GVs/pads\, if those aren't a hugely terrible idea.

Unrelated to the bug report\, what does Python do with bidi control characters? The PEP thread has a couple of suggestions (

http​://mail.python.org/pipermail/python-3000/2007-May/007750.html\, http​://mail.python.org/pipermail/python-3000/2007-May/007823.html\, \<http​://mail.python.org/pipermail/python-3000/2007-May/007823.html> http​://mail.python.org/pipermail/python-3000/2007-May/007826.html) but I don't how what they ended up implementing.

Haven't looked at that. Bidi is ugly\, since Perl stuff goes left to right\, and an RTL string could flip around weak bidi mirrors so they look different.

Interesting​:

I'll repeat that UTR#39 explicitly discourages support for formatting characters in identifiers.

And this one

  http​://mail.python.org/pipermail/python-3000/2007-May/007725.html

points out that Java can get away with this because they have all these default-ignorables they let by in source code. Yes\, you can put nulls and bells all over your Java source and the compiler will ignore them outside literals. Scary.

This

  http​://mail.python.org/pipermail/python-3000/2007-May/007833.html

seems as far as they got. I don't see any resolution. Too tired to hack out stupid bidi tricks right now to test.

Hm\, I wonder whether this has anything useful to say about the matter\, since they've had to think about it for URLs​:

  http​://www.w3.org/International/iri-edit/draft-duerst-iri-05.txt

--tom

p5pRT commented 13 years ago

From @nwc10

On Fri\, Aug 12\, 2011 at 02​:10​:55AM -0600\, Tom Christiansen wrote​:

I was worried about how this plays with Apple's HSF+\, given that it uses NFD. If you can a module named Écran\, I get nervous about how it gains a code point in length in the filesystem.

Strictly it doesn't​:

http​://developer.apple.com/library/mac/technotes/tn/tn1150.html#UnicodeSubtleties

  IMPORTANT​:

  An implementation must not use the Unicode utilities implemented   by its native platform (for decomposition and comparison)\, unless   those algorithms are equivalent to the HFS Plus algorithms defined   here\, and are guaranteed to be so forever. This is rarely the   case. Platform algorithms tend to evolve with the Unicode   standard. The HFS Plus algorithms cannot evolve because such   evolution would invalidate existing HFS Plus volumes.

It's a snapshot of NFD - I think even a snapshot of a late NFD *draft*. And it's not allowed to change.

Which I think was an issue Father C raised - Unicode evolves\, therefore normalisation changes. Should Perl snapshot a particular normalisation and keep that as canonical forever? Or should we run the (small risk) that (dangerously written) scripts will change behaviour as a side effect of running on a perl (newer or older) that doesn't use the same Unicode database.

This doesn't seem to be addressed at all in PEP 3131\, so I'm assuming that there isn't a working Python solution to adopt.

This

http&#8203;://mail\.python\.org/pipermail/python\-3000/2007\-May/007833\.html

seems as far as they got. I don't see any resolution. Too tired to hack out stupid bidi tricks right now to test.

Shame.

Does any language have a working implementation of normalised Unicode identifiers?

Nicholas Clark

p5pRT commented 13 years ago

From tchrist@perl.com

Nicholas Clark \nick@&#8203;ccl4\.org wrote   on Fri\, 12 Aug 2011 10​:23​:09 BST​:

On Fri\, Aug 12\, 2011 at 02​:10​:55AM -0600\, Tom Christiansen wrote​:

I was worried about how this plays with Apple's HSF+\, given that it uses NFD. If you can a module named Écran\, I get nervous about how it gains a code point in length in the filesystem.

Strictly it doesn't​:

...

It's a snapshot of NFD - I think even a snapshot of a late NFD *draft*. And it's not allowed to change.

I usually hedge that by saying that it's quasi-NFD. I don't know any module that implements it\, so it's really annoying to predict. I hate the poke it and see what shows up approach\, but maybe that's all one can do.

Which I think was an issue Father C raised - Unicode evolves\, therefore normalisation changes. Should Perl snapshot a particular normalisation and keep that as canonical forever? Or should we run the (small risk) that (dangerously written) scripts will change behaviour as a side effect of running on a perl (newer or older) that doesn't use the same Unicode database.

Is the fear that an unassigned code point would later get assigned something that changes under normalization? If people are using unassigned code points\, then I suppose this may happen\, but I can't see any other way. That's because of Unicode's strong stability guarantee on normalization. The key point is the last of the lines I quote below​:

  http​://unicode.org/policies/stability_policy.html

  Unlike many other standards\, the Unicode Standard is continually   expanding—new characters are added to meet a variety of uses\, ranging from   technical symbols to letters for archaic languages. Character properties   are also expanded or revised to meet implementation requirements.

  In each new version of the Unicode Standard\, the Unicode Consortium may add   characters or make certain changes to characters that were encoded in a   previous version of the standard. However\, the Consortium imposes   limitations on the types of changes that can be made\, in an effort to   minimize the impact on existing implementations.

  ...

  Normalization Stability

  Strong Normalization Stability   Applicable Version​: Unicode 4.1+

  If a string contains only characters from a given version of Unicode\, and it   is put into a normalized form in accordance with that version of Unicode\,   then the results will be identical to the results of putting that string   into a normalized form in accordance with any subsequent version of Unicode.

  More formally\, given versions V and U of Unicode\, and any string S   which only contains characters assigned according to both V and U\, the   following are always true​:

  toNFCV(S) = toNFCU(S)   toNFDV(S) = toNFDU(S)   toNFKCV(S) = toNFKCU(S)   toNFKDV(S) = toNFKDU(S)

  In particular\, once a character is encoded\, its canonical combining   class and decomposition mapping will not be changed in any way.

Now\, HSF+ came out in 1998\, but the stability guarantee only applies to Unicode version 4.1 and up\, and 4.1 itself came out 2005-03-31.

This doesn't seem to be addressed at all in PEP 3131\, so I'm assuming that there isn't a working Python solution to adopt.

I can't see that they've done anything about bidis.

Does any language have a working implementation of normalised Unicode identifiers?

What exactly do you mean by this? As I said\, Python runs them through NFC. This may have ramifications on HFS+. Python issue 11230 is about being able to import library modules with non-ASCII names\, as

  http​://bugs.python.org/issue11230

And in particular

  http​://bugs.python.org/msg128724

which reads​:

  Short answer​:

  In Python 3.2\, « import héhé » doesn't work on Windows\, but you can have non-ASCII paths in sys.path.

  Longer answer​:

  I fixed the import machinery to handle correctly non-ASCII characters   in module *paths*. But the import machinery is unable to handle   non-ASCII characters in module *names*​: it fails if the filesystem   encoding is not UTF-8 (eg. it fails on Windows). There is another   exception​: Python doesn't support (yet) non encodable module paths on   Windows. On Windows\, you can use any character in directory names\, but   Python 3.2 encodes paths to the filesystem encoding (ANSI code page)   which is a smaller charset. In practical\, this Windows specific   limitation on module paths doesn't really matter.

  I plan to fix all these issues in Python 3.3​: see #3080.

  --

  > Could you please make it clear in documentation and web pages\,   > that this feature is not working yet.

  What's New in Python 3.2 documentation has this sentence​: "Python’s   import mechanism can now load modules installed in directories with   non-ASCII characters in the path name. This solved an aggravating   problem with home directories for users with non-ASCII characters in   their usernames." which is correct.

  Which web page should updated/fixed?

So I don't think they have it working in module names either. Besides Perl\, all of Python\, Ruby\, Java\, and Go offer Unicode identifiers\, with various restrictions.

* Python does seem to do the IDS/IDC thing\, so you might see idents   with combining marks\, but these are run through NFC so tend to go   away for the common cases.

* Java I know to have filesystem issues\, but Java also allows for   random control characters in its identifiers\, which it completely   ignores and do not become part of those names.

* In contrast Go does not seem to use IDS/IDC\, because you get compiler   errors if you have combining marks (NFD forms)​:

  % 6g idents.go   idents.go​:4​: invalid identifier character 0x301   idents.go​:5​: invalid identifier character 0x301

  % uniquote -x \< idents.go   package main   func main() {   var \x{E9}cran = "NFC screen"   var e\x{301}cran = "NFD screen"   println("tes \x{E9}crans sont "\, \x{E9}cran\, " and "\, e\x{301}cran)   }

  So it doesn't mind E9\, but dislikes 301.

  (BTW\, I keep making errors in Python because of there being no strict   vars declaration that I can find the equivalent of\, whereas with   Go you don't have that problem.)

* I haven't poked at Ruby hard enough to know what it does here   with external names. But internally\, NFC and NFD forms are   distinct instead of normalized​:

  % ruby ident.ruby   nfc   nfd

  % uniquote -x \< ident.ruby   #!/usr/bin/env ruby   #coding​: utf-8   ni\x{F1}o = "nfc"
  nin\x{303}o = "nfd"
  puts ni\x{F1}o   puts nin\x{303}o

--tom