B API for aux_list/OP_MULTICONCAT does not return the last segment when plain & utf8 representations are different

p5pRT commented 6 years ago

Migrated from rt.perl.org#133535 (status was 'open')

Searchable as RT133535$

p5pRT commented 6 years ago

From @atoomic

Created by @atoomic

I noticed this while using B API with op/substr.t to compile it using B::C with Perl 5.28.0

From the comment in pp_hot.c we can read that in some cases we can have two sets of segment lengths

* * If the string has different plain and utf8 representations * (e.g. "\x80")\, then then aux[PERL_MULTICONCAT_IX_PLAIN_PV/LEN]] * holds the plain rep\, while aux[PERL_MULTICONCAT_IX_UTF8_PV/LEN] * holds the utf8 rep\, and there are 2 sets of segment lengths\, * with the utf8 set following after the plain set.

I've the feeling that B API aux_list for multiconcat is missing to read the last segment in that scenario

With this simplified version of op/substr.t\, it's easier to debug as we have one single multiconcat op. ________________________________________________________________________________ #!./perl

print "1..1\n";

use utf8; my $refee = bless []\, "\x{100}a"; my $string = $refee; $string = "$string"; substr $refee\, 0\, 0\, "\xff"; my $expect = "\xff$string"; # \<---- multiconcat print "$refee" eq $expect ? "ok 1\n" : "not ok 1\n"; ________________________________________________________________________________

While running the program we are going through this code\, where nargs=1\, so we are clearly using not the first but the second segment.

Perl_pp_multiconcat │676 const_lens = aux + PERL_MULTICONCAT_IX_LENGTHS; │ │677 │678 if (dst_utf8) { │ │679 const_pv = aux[PERL_MULTICONCAT_IX_UTF8_PV].pv; │ │680 if ( aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv │ │681 && const_pv != aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv) │ │682 /* separate sets of lengths for plain and utf8 */ │ >│683 const_lens += nargs + 1;

Here is a look at aux

# ----- dump of aux from Perl_pp_multiconcat # header aux = aux[0] = 1 aux[1] = \377 aux[2] = 1 aux[3] = "ÿ"\, aux[4] = 2

# first element aux[5] 1 # \<---- const_lens aux[6] -1 # second segment which was not returned by B::API aux[7] 2 aux[8] -1

Not exactly sure if adding such a rule is good enough but this is fixing the cases where before that we would only read the first segment

# Suggested patch to B API for aux_list/OP_MULTICONCAT if ( aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv && aux[PERL_MULTICONCAT_IX_UTF8_PV].pv && aux[PERL_MULTICONCAT_IX_UTF8_PV].pv != aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv ) { # read the additional segment nargs += 2; }

Perl Info

``` Flags: category=library severity=low module=B Site configuration information for perl 5.28.0: Configured by nicolas at Wed Nov 29 10:26:27 MST 2017. Summary of my perl5 (revision 5 version 26 subversion 1) configuration: Platform: osname=darwin osvers=15.6.0 archname=darwin-2level uname='darwin nicolas-r.local 15.6.0 darwin kernel version 15.6.0: mon oct 2 22:20:08 pdt 2017; root:xnu-3248.71.4~1release_x86_64 x86_64 ' config_args='-de -Dprefix=/usr/local/perl/perls/perl-5.28.0 -Aeval:scriptdir=/usr/local/perl/perls/perl-5.28.0/bin' hint=recommended useposix=true d_sigaction=define useithreads=undef usemultiplicity=undef use64bitint=define use64bitall=define uselongdouble=undef usemymalloc=n default_inc_excludes_dot=define bincompat5005=undef Compiler: cc='cc' ccflags ='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include' optimize='-O3' cppflags='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include' ccversion='' gccversion='4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)' gccosandvers='' intsize=4 longsize=8 ptrsize=8 doublesize=8 byteorder=12345678 doublekind=3 d_longlong=define longlongsize=8 d_longdbl=define longdblsize=16 longdblkind=3 ivtype='long' ivsize=8 nvtype='double' nvsize=8 Off_t='off_t' lseeksize=8 alignbytes=8 prototype=define Linker and Libraries: ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc' ldflags =' -fstack-protector-strong -L/usr/local/lib' libpth=/usr/local/lib /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/8.0.0/lib /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib /usr/lib libs=-lpthread -lgdbm -ldbm -ldl -lm -lutil -lc perllibs=-lpthread -ldl -lm -lutil -lc libc= so=dylib useshrplib=false libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs dlext=bundle d_dlsymun=undef ccdlflags=' ' cccdlflags=' ' lddlflags=' -bundle -undefined dynamic_lookup -L/usr/local/lib -fstack-protector-strong' @INC for perl 5.28.0: /Users/nicolas/.dotfiles/perl-must-have/lib /Users/nicolas/perl5/lib/perl5/ /usr/local/perl/perls/perl-5.28.0/lib/site_perl/5.28.0/darwin-2level /usr/local/perl/perls/perl-5.28.0/lib/site_perl/5.28.0 /usr/local/perl/perls/perl-5.28.0/lib/5.28.0/darwin-2level /usr/local/perl/perls/perl-5.28.0/lib/5.28.0 Environment for perl 5.28.0: DYLD_LIBRARY_PATH (unset) HOME=/Users/nicolas LANG=en_US.UTF-8 LANGUAGE (unset) LC_CTYPE=en_US.UTF-8 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/local/perl/bin:/usr/local/perl/perls/perl-5.28.0/bin:/usr/local/opt/ccache/libexec:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/usr/local/git/bin:/usr/local/MacGPG2/bin:/Users/nicolas/.dotfiles/bin:/Users/nicolas/perl5/bin PERL5DB=use Devel::NYTProf PERL5LIB=/Users/nicolas/.dotfiles/perl-must-have/lib:/Users/nicolas/perl5/lib/perl5/ PERLBREW_BASHRC_VERSION=0.80 PERLBREW_HOME=/Users/nicolas/.perlbrew PERLBREW_MANPATH=/usr/local/perl/perls/perl-5.28.0/man PERLBREW_PATH=/usr/local/perl/bin:/usr/local/perl/perls/perl-5.28.0/bin PERLBREW_PERL=perl-5.28.0 PERLBREW_ROOT=/usr/local/perl PERLBREW_VERSION=0.84 PERL_BADLANG (unset) PERL_CPANM_OPT=--quiet SHELL=/usr/local/bin/zsh ```

p5pRT commented 6 years ago

From @tonycoz

On Thu\, 20 Sep 2018 09:57:52 -0700\, atoomic wrote:

I noticed this while using B API with op/substr.t to compile it using B::C with Perl 5.28.0

From the comment in pp_hot.c we can read that in some cases we can have two sets of segment lengths

* * If the string has different plain and utf8 representations * (e.g. "\x80")\, then then aux[PERL_MULTICONCAT_IX_PLAIN_PV/LEN]] * holds the plain rep\, while aux[PERL_MULTICONCAT_IX_UTF8_PV/LEN] * holds the utf8 rep\, and there are 2 sets of segment lengths\, * with the utf8 set following after the plain set.

I've the feeling that B API aux_list for multiconcat is missing to read the last segment in that scenario

With this simplified version of op/substr.t\, it's easier to debug as we have one single multiconcat op. ________________________________________________________________________________ #!./perl

print "1..1\n";

use utf8; my $refee = bless []\, "\x{100}a"; my $string = $refee; $string = "$string"; substr $refee\, 0\, 0\, "\xff"; my $expect = "\xff$string"; # \<---- multiconcat print "$refee" eq $expect ? "ok 1\n" : "not ok 1\n"; ________________________________________________________________________________

While running the program we are going through this code\, where nargs=1\, so we are clearly using not the first but the second segment.

Perl_pp_multiconcat │676 const_lens = aux + PERL_MULTICONCAT_IX_LENGTHS; │ │677 │678 if (dst_utf8) { │ │679 const_pv = aux[PERL_MULTICONCAT_IX_UTF8_PV].pv; │ │680 if ( aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv │ │681 && const_pv != aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv) │ │682 /* separate sets of lengths for plain and utf8 */ │

│683 const_lens += nargs + 1;

Here is a look at aux

# ----- dump of aux from Perl_pp_multiconcat # header aux = aux[0] = 1 aux[1] = \377 aux[2] = 1 aux[3] = "ÿ"\, aux[4] = 2

# first element aux[5] 1 # \<---- const_lens aux[6] -1 # second segment which was not returned by B::API aux[7] 2 aux[8] -1

Not exactly sure if adding such a rule is good enough but this is fixing the cases where before that we would only read the first segment

# Suggested patch to B API for aux_list/OP_MULTICONCAT if ( aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv && aux[PERL_MULTICONCAT_IX_UTF8_PV].pv && aux[PERL_MULTICONCAT_IX_UTF8_PV].pv != aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv ) { # read the additional segment nargs += 2; }

Considering aux_list() code for OP_MULTICONCAT turns the offsets into character rather than byte offsets\, won't the 2 from:

aux[7] 2 aux[8] -1

be converted into a 1\, making it the same as the first segment?

I don't know what extra useful information you would get from this change.

Tony

p5pRT commented 6 years ago

The RT System itself - Status changed from 'new' to 'open'

Perl / perl5