Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.94k stars 554 forks source link

Unicode::UCD::charscript fails to identify Han ideograph #13469

Closed p5pRT closed 8 years ago

p5pRT commented 10 years ago

Migrated from rt.perl.org#120790 (status was 'resolved')

Searchable as RT120790$

p5pRT commented 10 years ago

From @mjdominus

Created by @mjdominus

This program​:

  perl -MUnicode​::UCD=charscript -wle 'print charscript(chr(0x6237)) // "undef"'

should print "Han"\, but instead it prints "undef". The same behavior occurs on two different machines\, with 5.18.1 and 5.14.2.

The applicable line of the Unicode data file http​://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt is​:

  4E00..9FCC ; Han # Lo [20941] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FCC

Perl Info ``` Flags: category=library severity=medium module=Unicode::UCD Site configuration information for perl 5.18.1: Configured by mjd at Tue Oct 8 12:58:09 EDT 2013. Summary of my perl5 (revision 5 version 18 subversion 1) configuration: Platform: osname=linux, osvers=3.2.0-54-generic, archname=x86_64-linux uname='linux ortolan 3.2.0-54-generic #82-ubuntu smp tue sep 10 20:08:42 utc 2013 x86_64 x86_64 x86_64 gnulinux ' config_args='-des -Dinc_version_list=none' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.6.3', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib /usr/lib libs=-lnsl -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.15' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector' Locally applied patches: @INC for perl 5.18.1: /usr/local/lib/perl5/site_perl/5.18.1/x86_64-linux /usr/local/lib/perl5/site_perl/5.18.1 /usr/local/lib/perl5/5.18.1/x86_64-linux /usr/local/lib/perl5/5.18.1 . Environment for perl 5.18.1: HOME=/home/mjd LANG=en_US.UTF-8 LANGUAGE= LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/mjd/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 10 years ago

From @mjdominus

The problem has been pointed out to me​: charscript\, despite its name\, wants a codepoint number\, not an actual character. This bug can be closed.

P.S.​: Its it considered general knowledge that our bug tracker totally sucks? I'm told that this isn't because RT itself sucks\, but because nobody on our side bothered to configure it properly. If someone wanted to fix this\, I would be glad to put in thirty or forty minutes and come up with a long list of complaints.

p5pRT commented 10 years ago

From @jkeenan

On Sat Dec 14 08​:14​:20 2013\, mjd@​plover.com wrote​:

The problem has been pointed out to me​: charscript\, despite its name\, wants a codepoint number\, not an actual character. This bug can be closed.

Closing per request from OP.

p5pRT commented 10 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 10 years ago

@jkeenan - Status changed from 'open' to 'rejected'

p5pRT commented 10 years ago

From @khwilliamson

On 12/14/2013 09​:13 AM\, Mark Dominus wrote​:

The problem has been pointed out to me​: charscript\, despite its name\, wants a codepoint number\, not an actual character. This bug can be closed.

Suppose charscript() and friends raised a warning if the code point argument passed to them is invalid\, instead of just returning undef (or the empty list as it currently does)? We could perhaps even suppress said warning unless the argument also had the utf8 flag set. That has the potential of breaking less code\, I think.

p5pRT commented 10 years ago

From zefram@fysh.org

Karl Williamson wrote​:

Suppose charscript() and friends raised a warning if the code point argument passed to them is invalid\,

Sounds good. Specifically\, you want a warning iff the argument would generate a warning if used in a numeric context. Because this is a numeric context.

suppress said warning unless the argument also had the utf8 flag set.

That does not sound like a good idea. "foo" is just as numerically invalid as "\x{2603}". If the user's passing a single character\, it'll sometimes be in the Latin-1 range\, in which case it could be represented either way. We want to make behaviour *less* dependent on the internal encoding of strings\, not more.

-zefram

p5pRT commented 10 years ago

From @mjdominus

Zefram \zefram@​fysh\.org​:

Karl Williamson wrote​:

Suppose charscript() and friends raised a warning if the code point argument passed to them is invalid\,

Sounds good. Specifically\, you want a warning iff the argument would generate a warning if used in a numeric context. Because this is a numeric context.

I'm not sure it makes sense to slow down every call to charscript() just to prevent what was actually an RTFM error.

p5pRT commented 10 years ago

From @khwilliamson

On 12/15/2013 05​:39 PM\, Mark Dominus wrote​:

Zefram \zefram@​fysh\.org​:

Karl Williamson wrote​:

Suppose charscript() and friends raised a warning if the code point argument passed to them is invalid\,

Sounds good. Specifically\, you want a warning iff the argument would generate a warning if used in a numeric context. Because this is a numeric context.

I'm not sure it makes sense to slow down every call to charscript() just to prevent what was actually an RTFM error.

This would slow down only error cases. As you pointed out\, the name of the function is misleading. It seems to me that it would be a reasonable thing for us to do to help users cope with that. It would also save this list time by keeping unwarranted bug reports from being filed.

One thing to note\, though\, is the best place to put the warning is in a common function used by all the functions in the module to do code point argument processing\, so the warning would be raised for all such functions.

p5pRT commented 10 years ago

From @khwilliamson

On 12/15/2013 10​:25 PM\, Karl Williamson wrote​:

On 12/15/2013 05​:39 PM\, Mark Dominus wrote​:

Zefram \zefram@​fysh\.org​:

Karl Williamson wrote​:

Suppose charscript() and friends raised a warning if the code point argument passed to them is invalid\,

Sounds good. Specifically\, you want a warning iff the argument would generate a warning if used in a numeric context. Because this is a numeric context.

I'm not sure it makes sense to slow down every call to charscript() just to prevent what was actually an RTFM error.

This would slow down only error cases. As you pointed out\, the name of the function is misleading. It seems to me that it would be a reasonable thing for us to do to help users cope with that. It would also save this list time by keeping unwarranted bug reports from being filed.

One thing to note\, though\, is the best place to put the warning is in a common function used by all the functions in the module to do code point argument processing\, so the warning would be raised for all such functions.

I looked at the cod of Unicode​::UCD. It turns out that most of the functions in it croak when they get this type of illegal parameter. And all but two of the rest call carp. This means that the only two that are silent are charblock() and charscript().

And\, the context isn't numeric. The parameter for these two functions can be either a number\, or the name of a script or block. If it doesn't look like a number\, it assumes it is a name\, and if there is no such name\, it returns undef.

It is a trivial matter to add a warning here\, which would not add CPU time to the success cases. But I'd like to get more of a consensus as to if doing so is advisable

p5pRT commented 9 years ago

From @khwilliamson

Based on the discussion\, I'm reopening this ticket to fix it instead of rejecting it -- Karl Williamson

p5pRT commented 9 years ago

@khwilliamson - Status changed from 'rejected' to 'open'

p5pRT commented 9 years ago

From @khwilliamson

Fixed in bc37b130604215b78ec3e03d73b81cb08cfa741e

Thanks for reporting the problem

-- Karl Williamson

p5pRT commented 9 years ago

@khwilliamson - Status changed from 'open' to 'pending release'

p5pRT commented 8 years ago

From @khwilliamson

Thank you for submitting this report. You have helped make Perl better.  
With the release of Perl 5.24.0 on May 9\, 2016\, this and 149 other issues have been resolved.

Perl 5.24.0 may be downloaded via https://metacpan.org/release/RJBS/perl-5.24.0

p5pRT commented 8 years ago

@khwilliamson - Status changed from 'pending release' to 'resolved'