substr() indexing is wrong on very big numbers

florian-pe commented 2 years ago

Above a certain value, the output of substr EXPR,OFFSET is the last character of the string

$ perl -E '$var="hello"; say substr($var, 3 . "0" x 18)'

$ perl -E 'say substr("hello", 18446744073709551615)'

$ perl -E 'say substr("hello", 18446744073709551616)'
o

$ perl -E '$var="hello"; say substr($var, 3 . "0" x 19)'
o

$ perl -E '$var="hello"; say substr($var, 3 . "0" x 200)'
o

$ perl -E '$var="hellA"; say substr($var, 3 . "0" x 19)'
A

$ perl -E '$var="hellE"; say substr($var, 3 . "0" x 19)'
E

And above the same value, the output of substr EXPR,OFFSET,LENGTH chop off the last character of the string

$ perl -E '$var="hello"; say substr($var, 0, 3 . "0" x 18)'
hello

$ perl -E '$var="hellA"; say substr($var, 0, 3 . "0" x 18)'
hellA

$ perl -E 'say substr("hello", 0, 18446744073709551615)'
hello

$ perl -E 'say substr("hello", 0, 18446744073709551616)'
hell

$ perl -E '$var="hello"; say substr($var, 0, 3 . "0" x 19)'
hell

$ perl -E '$var="hellA"; say substr($var, 0, 3 . "0" x 19)'
hell

$ perl -E '$var="hello"; say substr($var, 0, 3 . "0" x 100)'
hell

$ perl -E '$var="hellA"; say substr($var, 0, 3 . "0" x 100)'
hell

$ perl -V
Summary of my perl5 (revision 5 version 36 subversion 0) configuration:

  Platform:
    osname=linux
    osvers=5.12.15-arch1-1
    archname=x86_64-linux-thread-multi
    uname='archlinux'
    config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection -flto -Dprefix=/usr -Dvendorprefix=/usr -Dprivlib=/usr/share/perl5/core_perl -Darchlib=/usr/lib/perl5/5.36/core_perl -Dsitelib=/usr/share/perl5/site_perl -Dsitearch=/usr/lib/perl5/5.36/site_perl -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/5.36/vendor_perl -Dscriptdir=/usr/bin/core_perl -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl -Dcccdlflags='-fPIC' -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -flto -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -flto'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='cc'
    ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    optimize='-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -flto'
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='12.1.0'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -flto -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib
    libs=-lpthread -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -ldl -lm -lcrypt -lutil -lc
    libc=/lib/../lib/libc.so.6
    so=so
    useshrplib=true
    libperl=libperl.so
    gnulibc_version='2.35'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.36/core_perl/CORE'
    cccdlflags='-fPIC'
    lddlflags='-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -flto -L/usr/local/lib -fstack-protector-strong'

Characteristics of this binary (from libperl): 
  Compile-time options:
    HAS_TIMES
    MULTIPLICITY
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    USE_64_BIT_ALL
    USE_64_BIT_INT
    USE_ITHREADS
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_PERLIO
    USE_PERL_ATOF
    USE_REENTRANT_API
    USE_THREAD_SAFE_LOCALE
  Built under linux
  Compiled at May 29 2022 08:49:34
  %ENV:
    PERL5LIB="/home/USER/.my_configurations/perl_modules"
  @INC:
    /home/USER/.my_configurations/perl_modules
    /usr/lib/perl5/5.36/site_perl
    /usr/share/perl5/site_perl
    /usr/lib/perl5/5.36/vendor_perl
    /usr/share/perl5/vendor_perl
    /usr/lib/perl5/5.36/core_perl
    /usr/share/perl5/core_perl

florian-pe commented 2 years ago

Here is some additional numerology.

We get the entire string as long as the Most Significant Bit of a pointer size value is not set

$ perl -E 'use bignum; say substr("hello", 0, 2**63-3)'
hello
$ perl -E 'use bignum; say substr("hello", 0, 2**63-2)'
hello
$ perl -E 'use bignum; say substr("hello", 0, 2**63-1)'
hello
$ perl -E 'use bignum; say substr("hello", 0, 2**63)'

$ perl -E 'use bignum; say substr("hello", 0, 2**63+1)'

then no value, as long as the MSB is set

$ perl -E 'use bignum; say substr("hello", 0, 2**63)'

$ perl -E 'use bignum; say substr("hello", 0, 2**63+1)'

$ perl -E 'use bignum; say substr("hello", 0, 2**63+2**32)'

$ perl -E 'use bignum; say substr("hello", 0, 2**63+2**62)'

Then we start getting some of the string again, probably following the overflow, which should unset the MSB

$ perl -E 'use bignum; say substr("hello", 0, 2**63+2**63)'
hell
$ perl -E 'use bignum; say substr("hello", 0, 2**64)'
hell
$ perl -E 'use bignum; say substr("hello", 0, 2**64+1)'
hell
$ perl -E 'use bignum; say substr("hello", 0, 2**64+2)'
hell

But then it behaves as if the (whichever it is) sign bit is set ...

$ perl -E 'use bignum; say substr("hello", 0, 2**64)'
hell
$ perl -E 'use bignum; say substr("hello", 0, 2**64-1)'
hell
$ perl -E 'use bignum; say substr("hello", 0, 2**64-2)'
hel
$ perl -E 'use bignum; say substr("hello", 0, 2**64-3)'
he
$ perl -E 'use bignum; say substr("hello", 0, 2**64-4)'
h
$ perl -E 'use bignum; say substr("hello", 0, 2**64-5)'

$ perl -E 'use bignum; say substr("hello", 0, 2**64-6)'

... because it behaves the same way as this sequence of negative numbers

$ perl -E 'use bignum; say substr("hello", 0, -0)'

$ perl -E 'use bignum; say substr("hello", 0, -1)'
hell
$ perl -E 'use bignum; say substr("hello", 0, -2)'
hel
$ perl -E 'use bignum; say substr("hello", 0, -3)'
he
$ perl -E 'use bignum; say substr("hello", 0, -4)'
h
$ perl -E 'use bignum; say substr("hello", 0, -5)'

$ perl -E 'use bignum; say substr("hello", 0, -6)'

demerphq commented 2 years ago

Personally I don't really think this is a bug in substr().

This is an example where perls flexibility with numeric types produces surprising results. Lets look at what perl thinks of that first number:

$ perl -MDevel::Peek -le'Dump(18446744073709551615)' SV = IV(0x55c42a6896d0) at 0x55c42a6896e0 REFCNT = 1 FLAGS = (IOK,READONLY,PROTECT,pIOK,IsUV) UV = 18446744073709551615

This value happens the decimal representation of UV_MAX, eg, 2**64-1. It is the highest value perl alone can represent as a true integer type. Add one and you get this:

$ perl -MDevel::Peek -le'Dump(18446744073709551616)' SV = NV(0x559585d816c8) at 0x559585d816e0 REFCNT = 1 FLAGS = (NOK,READONLY,PROTECT,pNOK) NV = 1.84467440737096e+19

Eg, it has been converted to an NV, a double, eg, floating point, which best represents the original integer.

The internals logic uses the macro SvIV() to get a signed representation of the argument, it then checks to see if this IV is actually a UV. As you can see it is a UV so substr does not think it is a negative number. With the NV case the NV is converted to an IV and the "is UV" flag is not set, and it turns into -1. If there is a bug I guess it would be here.

The second set of output you provided relates to bignum, and I believe you see a similar set of effects, the exact details I am not sure of but it wouldnt surprise me if a bignum turns into a float which is then converted to an IV and we see the same issue as above.

This would occur in just about any part of our API's where we internally cast data into a UV/IV, so if we need to address it (it feels like a "well dont do that") then we should address it at the numeric layer, not the substr() layer. I dont know about the details of NV -> IV/UV conversion, it feels wrong that sign changes in the above cases.

Yves

bram-perl commented 2 years ago

(it feels like a "well dont do that")

but then what should someone with a string of 16 exibyte (2 ** 60) do?! :-)

sisyphus commented 2 years ago

I can see consistency in the non-bignum examples. The behaviour is consistent with the OFFSET/LENGTH argument being evaluated as -1 if it exceeds UV_MAX. This is reproducible in the following Inline::C script:

use warnings;
use Config;

use Inline C => Config =>
 BUILD_NOISY => 1,       # else any compilation warnings are hidden
;

use Inline C => <<'EOC';

SV * foo(SV * x) {
  if(SvUOK(x)) return newSVuv(SvUV(x));
  return newSViv(SvIV(x)); 
}

EOC

for(18446744073709551615, 3 . "0" x 18, 18446744073709551616, 3 . "0" x 19, 3 . "0" x 200) {
  print foo($_), "\n";
}

__END__

Outputs:

18446744073709551615
3000000000000000000
-1
-1
-1

To me, that provides credence to the behaviour. I have less than zero interest in what happens when bignum gets involved. (I don't mean to denigrate "bignum" ... this is just something I do in an attempt to preserve my sanity.)

Cheers, Rob

Perl / perl5

substr() indexing is wrong on very big numbers #20105