Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.92k stars 549 forks source link

Inconsistency in Script Run #16704

Closed p5pRT closed 5 years ago

p5pRT commented 6 years ago

Migrated from rt.perl.org#133547 (status was 'resolved')

Searchable as RT133547$

p5pRT commented 6 years ago

From ph10@hermes.cam.ac.uk

Created by ph10@cam.ac.uk

I was running some tests on the new (*script_run​:...) regex feature\, preparatory to implementing it in PCRE. As I understand it from reading perlre\, the ASCII digits 0-9 should be acceptable in any script run\, provided there aren't any other digits. There seems to be some inconsistency. Consider these two examples​:

$ perl -e 'if ("\x{3041}12\x{3041}" =~ /^(*sr​:.{4})/) { print "yes >$&\<\n"; } else { print "no \n"; }' yes >ぁ12ぁ\<

In this example\, the two ASCII digits "12" are flanked by two Hiragana characters; the pattern matches. This is also true for many other scripts\, including Greek\, Cyrillic\, Armenian\, Hebrew\, Arabic\, Ethiopic\, and Ogham.

$ perl -e 'if ("\x{0980}12\x{0993}" =~ /^(*sr​:.{4})/) { print "yes >$&\<\n"; } else { print "no \n"; }' no

In this example\, the two ASCII digits "12" are flanged by two Bengali characters; the pattern does not match. This is also true for Thaana\, Thai\, Khmer and Devanagari.

Why the difference? I haven't exhaustively tested all possible scripts\, and I haven't spotted any pattern in which ones match and which ones don't.

Philip

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl 5.28.0: Configured by builduser at Wed Aug 1 10:43:08 CEST 2018. Summary of my perl5 (revision 5 version 28 subversion 0) configuration: Platform: osname=linux osvers=4.17.11-arch1 archname=x86_64-linux-thread-multi uname='linux flo-64s 4.17.11-arch1 #1 smp preempt sun jul 29 10:11:16 utc 2018 x86_64 gnulinux ' config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt -Dprefix=/usr -Dvendorprefix=/usr -Dprivlib=/usr/share/perl5/core_perl -Darchlib=/usr/lib/perl5/5.28/core_perl -Dsitelib=/usr/share/perl5/site_perl -Dsitearch=/usr/lib/perl5/5.28/site_perl -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/5.28/vendor_perl -Dscriptdir=/usr/bin/core_perl -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl -Dcccdlflags='-fPIC' -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now' hint=recommended useposix=true d_sigaction=define useithreads=define usemultiplicity=define use64bitint=define use64bitall=define uselongdouble=undef usemymalloc=n default_inc_excludes_dot=define bincompat5005=undef Compiler: cc='cc' ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' optimize='-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt' cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include' ccversion='' gccversion='8.1.1 20180531' gccosandvers='' intsize=4 longsize=8 ptrsize=8 doublesize=8 byteorder=12345678 doublekind=3 d_longlong=define longlongsize=8 d_longdbl=define longdblsize=16 longdblkind=3 ivtype='long' ivsize=8 nvtype='double' nvsize=8 Off_t='off_t' lseeksize=8 alignbytes=8 prototype=define Linker and Libraries: ld='cc' ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -fstack-protector-strong -L/usr/local/lib' libpth=/usr/local/lib /usr/lib/gcc/x86_64-pc-linux-gnu/8.1.1/include-fixed /usr/lib /lib/../lib /usr/lib/../lib /lib /lib64 /usr/lib64 libs=-lpthread -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lpthread -ldl -lm -lcrypt -lutil -lc libc=libc-2.27.so so=so useshrplib=true libperl=libperl.so gnulibc_version='2.27' Dynamic Linking: dlsrc=dl_dlopen.xs dlext=so d_dlsymun=undef ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.28/core_perl/CORE' cccdlflags='-fPIC' lddlflags='-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -L/usr/local/lib -fstack-protector-strong' @INC for perl 5.28.0: /usr/lib/perl5/5.28/site_perl /usr/share/perl5/site_perl /usr/lib/perl5/5.28/vendor_perl /usr/share/perl5/vendor_perl /usr/lib/perl5/5.28/core_perl /usr/share/perl5/core_perl Environment for perl 5.28.0: HOME=/home/ph10 LANG=en_GB.utf8 LANGUAGE=en_GB.utf8 LC_ALL=C LC_COLLATE=C LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/ph10/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/android-sdk/platform-tools:/opt/android-sdk/tools:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/sbin:.:/opt/android-sdk/platform-tools:/opt/android-sdk/tools:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 6 years ago

From @abigail

On Thu\, Sep 27\, 2018 at 10​:04​:22AM -0700\, Philip Hazel (via RT) wrote​:

# New Ticket Created by Philip Hazel # Please include the string​: [perl #133547] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133547 >

From​: ph10@​cam.ac.uk To​: perlbug@​perl.org Message-Id​: \5\.28\.0\_31268\_1538066218@&#8203;quercite Reply-To​: ph10@​cam.ac.uk Cc​: builduser Subject​: Script Run Consistency

This is a bug report for perl from ph10@​cam.ac.uk\, generated with the help of perlbug 1.41 running under perl 5.28.0.

----------------------------------------------------------------- [Please describe your issue here]

I was running some tests on the new (*script_run​:...) regex feature\, preparatory to implementing it in PCRE. As I understand it from reading perlre\, the ASCII digits 0-9 should be acceptable in any script run\, provided there aren't any other digits. There seems to be some inconsistency. Consider these two examples​:

$ perl -e 'if ("\x{3041}12\x{3041}" =~ /^(*sr​:.{4})/) { print "yes >$&\<\n"; } else { print "no \n"; }' yes >ぁ12ぁ\<

In this example\, the two ASCII digits "12" are flanked by two Hiragana characters; the pattern matches. This is also true for many other scripts\, including Greek\, Cyrillic\, Armenian\, Hebrew\, Arabic\, Ethiopic\, and Ogham.

$ perl -e 'if ("\x{0980}12\x{0993}" =~ /^(*sr​:.{4})/) { print "yes >$&\<\n"; } else { print "no \n"; }' no

In this example\, the two ASCII digits "12" are flanged by two Bengali characters; the pattern does not match. This is also true for Thaana\, Thai\, Khmer and Devanagari.

Why the difference? I haven't exhaustively tested all possible scripts\, and I haven't spotted any pattern in which ones match and which ones don't.

Can you check with blead? I reported this in August\, and Karl fixed that the same day. So 5.28.0 is broken\, but blead should do things correctly.

Regards\,

Abigail

p5pRT commented 6 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 6 years ago

From ph10@hermes.cam.ac.uk

On Thu\, 27 Sep 2018\, Abigail via RT wrote​:

Can you check with blead?

Not without some research and learning how to do that. :-) But if I get some time I'll have a go.

I reported this in August\, and Karl fixed that the same day. So 5.28.0 is broken\, but blead should do things correctly.

I'm pleased to learn that it *is* a bug\, and not some misunderstanding on my part. Thanks for the fast response.

Regards\, Philip

-- Philip Hazel

p5pRT commented 6 years ago

From @khwilliamson

This is fixed by

commit 393e5a4585b92e635cfc4eee34da8f86f3bfd2af Author​: Karl Williamson \khw@&#8203;cpan\.org Date​: Sun Sep 30 10​:38​:02 2018 -0600

  PATCH​: [perl #133547]​: script run broken  
  All scripts can have the ASCII digits for their numbers. Scripts with   their own digits can alternatively use those. Only one of these two   sets can be used in a script run. The decision as to which set to use   must be deferred until the first digit is encountered\, as otherwise we   don't know which set will be used. Prior to this commit\, the decision   was being made prematurely in some cases. As a result of this change\,   the non-ASCII-digits in the Common script need to be special-cased\, and   different criteria are used to decide if we need to look up whether a   character is a digit or not.

-- Karl Williamson

p5pRT commented 6 years ago

@khwilliamson - Status changed from 'open' to 'pending release'

p5pRT commented 6 years ago

From @khwilliamson

On 09/28/2018 01​:33 AM\, ph10@​hermes.cam.ac.uk wrote​:

On Thu\, 27 Sep 2018\, Abigail via RT wrote​:

Can you check with blead?

Not without some research and learning how to do that. :-) But if I get some time I'll have a go.

I reported this in August\, and Karl fixed that the same day. So 5.28.0 is broken\, but blead should do things correctly.

I'm pleased to learn that it *is* a bug\, and not some misunderstanding on my part. Thanks for the fast response.

Regards\, Philip

The fix for this should be put in 5.28.1.

perlre has been updated since 5.28.0 to make clearer the acceptable behavior of a run. Hopefully\, if you had read it\, you wouldn't have thought it was a misunderstanding​:

https://perl5.git.perl.org/perl.git/commitdiff/4a1d964056983f26f5646fdb7aadb4b5e7b5235f

p5pRT commented 6 years ago

From ph10@hermes.cam.ac.uk

On Sun\, 30 Sep 2018\, Karl Williamson wrote​:

perlre has been updated since 5.28.0 to make clearer the acceptable behavior of a run. Hopefully\, if you had read it\, you wouldn't have thought it was a misunderstanding​:

https://perl5.git.perl.org/perl.git/commitdiff/4a1d964056983f26f5646fdb7aadb4b5e7b5235f

Many thanks\, Karl. That confirms what I had (finally :-) deduced myself\, and is admirably clear.

Regards\, Philip

-- Philip Hazel

p5pRT commented 6 years ago

From ph10@hermes.cam.ac.uk

On Sun\, 30 Sep 2018\, Karl Williamson wrote​:

The fix for this should be put in 5.28.1.

I have downloaded v5.29.4 (v5.29.3-35-g4288c5b93b) and can confirm that all the issues I previous reported are fixed. However\, there are still two oddities that don't seem to be right. The digit sequences FF10..FF19 and 1D7CE..1D7FF (both in the Common script) don't seem to work as I expected them. A string containing them along with Latin characters is not valid as a script run in this testing Perl. Indeed\, a string with only one of them and Latin characters doesn't match (which it surely should\, regardless of being a digit\, since it is in the Common script). Two of them on their own\, without any Latin characters does match.

These strings match the pattern /^(*sr​:.{4})/

  \x{ff10}\x{ff19}..   \x{1d7ce}\x{1d7cf}\,\,  
These don't​:

  A\x{ff10}\x{ff19}B   A\x{ff10}BC   A\x{1d7ce}\x{1d7cf}B   A\x{1d7ce}BC

Regards\, Philip

-- Philip Hazel

p5pRT commented 6 years ago

From @khwilliamson

On 10/02/2018 03​:57 AM\, ph10@​hermes.cam.ac.uk wrote​:

On Sun\, 30 Sep 2018\, Karl Williamson wrote​:

The fix for this should be put in 5.28.1.

I have downloaded v5.29.4 (v5.29.3-35-g4288c5b93b) and can confirm that all the issues I previous reported are fixed. However\, there are still two oddities that don't seem to be right. The digit sequences FF10..FF19 and 1D7CE..1D7FF (both in the Common script) don't seem to work as I expected them. A string containing them along with Latin characters is not valid as a script run in this testing Perl. Indeed\, a string with only one of them and Latin characters doesn't match (which it surely should\, regardless of being a digit\, since it is in the Common script). Two of them on their own\, without any Latin characters does match.

These strings match the pattern /^(*sr​:.{4})/

\x{ff10}\x{ff19}.. \x{1d7ce}\x{1d7cf}\,\,

These don't​:

A\x{ff10}\x{ff19}B A\x{ff10}BC A\x{1d7ce}\x{1d7cf}B A\x{1d7ce}BC

Technically\, this isn't a bug\, but a design flaw.

My design was to allow only ASCII 0-9 to be allowed with other scripts. Your second batch of cases here are in the Latin script\, and therefore the only digits from Common that are allowed are the ASCII ones.

But that is not what a reasonable person would expect\, and so the design is wrong.

I see two choices​:

1) Allow the non-ASCII digits that are considered Common to match the Latin script

2) Allow these to match any script\, just like the ASCII ones already do.

The second solution seems more in keeping with Unicode's intent\, since they made these digits Common\, so should be allowed in multiple scripts.   But the requirement that all digits in a run must come from the same sequence of 10 would remain.

I'm open to hearing arguments either way\, or some third way.

p5pRT commented 6 years ago

From ph10@hermes.cam.ac.uk

On Sun\, 30 Sep 2018\, Karl Williamson wrote​:

perlre has been updated since 5.28.0 to make clearer the acceptable behavior of a run.

Sorry to nag you again\, but have I got the following right? Perl allows a Common or Inherited character in a script run only if its Script Extension property lists the script of other characters in the run\, or if it doesn't figure in the Extensions file. Example​: the longest script run in "AB\x{1cf7}" is "AB"\, even though 1cf7 is a Common character.

Regards\, Philip

-- Philip Hazel

p5pRT commented 6 years ago

From ph10@hermes.cam.ac.uk

On Tue\, 2 Oct 2018\, Karl Williamson wrote​:

Technically\, this isn't a bug\, but a design flaw.

Nice distinction! :-)

2) Allow these to match any script\, just like the ASCII ones already do.

That is what I expected\, and what I have tentatively implemented.

The second solution seems more in keeping with Unicode's intent\, since they made these digits Common\, so should be allowed in multiple scripts. But the requirement that all digits in a run must come from the same sequence of 10 would remain.

Yes\, indeed.

Regards\, Philip

-- Philip Hazel

p5pRT commented 6 years ago

From @khwilliamson

On 10/02/2018 09​:12 AM\, ph10@​hermes.cam.ac.uk wrote​:

On Sun\, 30 Sep 2018\, Karl Williamson wrote​:

perlre has been updated since 5.28.0 to make clearer the acceptable behavior of a run.

Sorry to nag you again\, but have I got the following right? Perl allows a Common or Inherited character in a script run only if its Script Extension property lists the script of other characters in the run\, or if it doesn't figure in the Extensions file. Example​: the longest script run in "AB\x{1cf7}" is "AB"\, even though 1cf7 is a Common character.

1cf7 is not a Common character in the Script Extensions property\, and so yes the longest script run in that string is AB. The Script property is irrelevant to script runs.

p5pRT commented 6 years ago

From ph10@hermes.cam.ac.uk

On Tue\, 2 Oct 2018\, Karl Williamson wrote​:

1cf7 is not a Common character in the Script Extensions property\, and so yes the longest script run in that string is AB. The Script property is irrelevant to script runs.

I must be misunderstanding something. I do not see the word "common" anywhere in the ScriptExtensions.txt file. Ah! It *is* the script extensions property for characters that are not mentioned whose Script property is Common. Is that it? (This is proving to be much more complicated that I expected. :-) Thanks for putting up with me.

Regards\, Philip

-- Philip Hazel

p5pRT commented 6 years ago

From @khwilliamson

On 10/02/2018 10​:10 AM\, ph10@​hermes.cam.ac.uk wrote​:

On Tue\, 2 Oct 2018\, Karl Williamson wrote​:

1cf7 is not a Common character in the Script Extensions property\, and so yes the longest script run in that string is AB. The Script property is irrelevant to script runs.

I must be misunderstanding something. I do not see the word "common" anywhere in the ScriptExtensions.txt file. Ah! It *is* the script extensions property for characters that are not mentioned whose Script property is Common. Is that it? (This is proving to be much more complicated that I expected. :-) Thanks for putting up with me.

The top of ScriptExtensions.txt says​:

# All code points not explicitly listed for Script_Extensions # have as their value the corresponding Script property value

The way mktables creates scx is to create a copy of sc\, and then override the entries that are in ScriptExtensions.txt.

p5pRT commented 5 years ago

From @steve-m-hay

On Sun\, 30 Sep 2018 09​:51​:12 -0700\, khw wrote​:

This is fixed by

commit 393e5a4585b92e635cfc4eee34da8f86f3bfd2af Author​: Karl Williamson \khw@&#8203;cpan\.org Date​: Sun Sep 30 10​:38​:02 2018 -0600

Karl\, is there any chance you could prepare a patch for applying to maint-5.28? It doesn't cherry-pick cleanly and I think you're probably better placed than me to resolve the conflicts.

p5pRT commented 5 years ago

From @khwilliamson

I have now applied​: commit f4e61fc03836484ea88518e8bf04cc1b32a6a1a0 Author​: Karl Williamson \khw@&#8203;cpan\.org Date​: Thu Mar 14 11​:48​:11 2019 -0600

  Any Common digit set can match in any script  
  This fixes a design flaw in script runs that in 5.30 effectively   prevented digits from the Common script except the ASCII [0-9] from   being in any meaningful script run. -- Karl Williamson

p5pRT commented 5 years ago

From @khwilliamson

On 1/9/19 11​:16 AM\, Steve Hay via RT wrote​:

On Sun\, 30 Sep 2018 09​:51​:12 -0700\, khw wrote​:

This is fixed by

commit 393e5a4585b92e635cfc4eee34da8f86f3bfd2af Author​: Karl Williamson \khw@&#8203;cpan\.org Date​: Sun Sep 30 10​:38​:02 2018 -0600

Karl\, is there any chance you could prepare a patch for applying to maint-5.28? It doesn't cherry-pick cleanly and I think you're probably better placed than me to resolve the conflicts.

I didn't do this because of the design flaw in 5.30 this ticket showed. That has now been fixed by f4e61fc03836484ea88518e8bf04cc1b32a6a1a0 which I don't know if is suitable for back porting or not\,.

--- via perlbug​: queue​: perl5 status​: pending release https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133547

p5pRT commented 5 years ago

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release today of Perl 5.30.0\, this and 160 other issues have been resolved.

Perl 5.30.0 may be downloaded via​: https://metacpan.org/release/XSAWYERX/perl-5.30.0

If you find that the problem persists\, feel free to reopen this ticket.

p5pRT commented 5 years ago

@khwilliamson - Status changed from 'pending release' to 'resolved'