Open p5pRT opened 6 years ago
I noticed this while using B API with op/substr.t to compile it using B::C with Perl 5.28.0
From the comment in pp_hot.c we can read that in some cases we can have two sets of segment lengths
* * If the string has different plain and utf8 representations * (e.g. "\x80")\, then then aux[PERL_MULTICONCAT_IX_PLAIN_PV/LEN]] * holds the plain rep\, while aux[PERL_MULTICONCAT_IX_UTF8_PV/LEN] * holds the utf8 rep\, and there are 2 sets of segment lengths\, * with the utf8 set following after the plain set.
I've the feeling that B API aux_list for multiconcat is missing to read the last segment in that scenario
With this simplified version of op/substr.t\, it's easier to debug as we have one single multiconcat op. ________________________________________________________________________________ #!./perl
print "1..1\n";
use utf8; my $refee = bless []\, "\x{100}a"; my $string = $refee; $string = "$string"; substr $refee\, 0\, 0\, "\xff"; my $expect = "\xff$string"; # \<---- multiconcat print "$refee" eq $expect ? "ok 1\n" : "not ok 1\n"; ________________________________________________________________________________
While running the program we are going through this code\, where nargs=1\, so we are clearly using not the first but the second segment.
Perl_pp_multiconcat β676 const_lens = aux + PERL_MULTICONCAT_IX_LENGTHS; β β677 β678 if (dst_utf8) { β β679 const_pv = aux[PERL_MULTICONCAT_IX_UTF8_PV].pv; β β680 if ( aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv β β681 && const_pv != aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv) β β682 /* separate sets of lengths for plain and utf8 */ β >β683 const_lens += nargs + 1;
Here is a look at aux
# ----- dump of aux from Perl_pp_multiconcat # header aux = aux[0] = 1 aux[1] = \377 aux[2] = 1 aux[3] = "ΓΏ"\, aux[4] = 2
# first element aux[5] 1 # \<---- const_lens aux[6] -1 # second segment which was not returned by B::API aux[7] 2 aux[8] -1
Not exactly sure if adding such a rule is good enough but this is fixing the cases where before that we would only read the first segment
# Suggested patch to B API for aux_list/OP_MULTICONCAT if ( aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv && aux[PERL_MULTICONCAT_IX_UTF8_PV].pv && aux[PERL_MULTICONCAT_IX_UTF8_PV].pv != aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv ) { # read the additional segment nargs += 2; }
On Thu\, 20 Sep 2018 09:57:52 -0700\, atoomic wrote:
I noticed this while using B API with op/substr.t to compile it using B::C with Perl 5.28.0
From the comment in pp_hot.c we can read that in some cases we can have two sets of segment lengths
* * If the string has different plain and utf8 representations * (e.g. "\x80")\, then then aux[PERL_MULTICONCAT_IX_PLAIN_PV/LEN]] * holds the plain rep\, while aux[PERL_MULTICONCAT_IX_UTF8_PV/LEN] * holds the utf8 rep\, and there are 2 sets of segment lengths\, * with the utf8 set following after the plain set.
I've the feeling that B API aux_list for multiconcat is missing to read the last segment in that scenario
With this simplified version of op/substr.t\, it's easier to debug as we have one single multiconcat op. ________________________________________________________________________________ #!./perl
print "1..1\n";
use utf8; my $refee = bless []\, "\x{100}a"; my $string = $refee; $string = "$string"; substr $refee\, 0\, 0\, "\xff"; my $expect = "\xff$string"; # \<---- multiconcat print "$refee" eq $expect ? "ok 1\n" : "not ok 1\n"; ________________________________________________________________________________
While running the program we are going through this code\, where nargs=1\, so we are clearly using not the first but the second segment.
Perl_pp_multiconcat β676 const_lens = aux + PERL_MULTICONCAT_IX_LENGTHS; β β677 β678 if (dst_utf8) { β β679 const_pv = aux[PERL_MULTICONCAT_IX_UTF8_PV].pv; β β680 if ( aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv β β681 && const_pv != aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv) β β682 /* separate sets of lengths for plain and utf8 */ β
β683 const_lens += nargs + 1;
Here is a look at aux
# ----- dump of aux from Perl_pp_multiconcat # header aux = aux[0] = 1 aux[1] = \377 aux[2] = 1 aux[3] = "ΓΏ"\, aux[4] = 2
# first element aux[5] 1 # \<---- const_lens aux[6] -1 # second segment which was not returned by B::API aux[7] 2 aux[8] -1
Not exactly sure if adding such a rule is good enough but this is fixing the cases where before that we would only read the first segment
# Suggested patch to B API for aux_list/OP_MULTICONCAT if ( aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv && aux[PERL_MULTICONCAT_IX_UTF8_PV].pv && aux[PERL_MULTICONCAT_IX_UTF8_PV].pv != aux[PERL_MULTICONCAT_IX_PLAIN_PV].pv ) { # read the additional segment nargs += 2; }
Considering aux_list() code for OP_MULTICONCAT turns the offsets into character rather than byte offsets\, won't the 2 from:
aux[7] 2 aux[8] -1
be converted into a 1\, making it the same as the first segment?
I don't know what extra useful information you would get from this change.
Tony
The RT System itself - Status changed from 'new' to 'open'
Migrated from rt.perl.org#133535 (status was 'open')
Searchable as RT133535$