Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.94k stars 554 forks source link

SV Arenas have duplicate sized pool slots #22665

Open bulk88 opened 5 days ago

bulk88 commented 5 days ago

Description

SV Body arena roots are duplicative and redundant. SVt_PVNV and SVt_PVHV are identical. SVt_INVLIST, SVt_PVAV, SVt_PVOBJ are identical. SVt_PVMG and SVt_PVGV are identical. SVt_PVCV and SVt_PVFM are identical.

SVt_PVFM is marked "NOARENA" yet would fit exactly into SVt_PVCV's pool.

Why arent these free memory pools sorted and deduped by size? It would make some room for struct MG and struct GP to be pool allocated instead of malloc()ed, and remove the 2 pointer sized secret header cost of each malloc allocation block.

first number on the left is size of the body struct in bytes, on a 64b CPU.

body_details <0, 0, 0, 40h, 0> SVt_NULL
body_details <0, 8, 20h, 41h, 0> SVt_IV
body_details <0, 8, 28h, 2, 0> SVt_NV
body_details <10h, 10h, 10h, 0C3h, 0DD0h> SVt_PV
body_details <28h, 21h, 10h, 0E4h, 0C30h> SVt_INVLIST
body_details <18h, 18h, 10h, 0C5h, 0D60h> SVt_PVIV
body_details <20h, 20h, 10h, 86h, 0CE0h> SVt_PVNV
body_details <30h, 30h, 0, 87h, 0FF0h> SVt_PVMG
body_details <0E0h, 0E0h, 0, 0E8h, 0FC0h> SVt_REGEXP
body_details <30h, 30h, 0, 0A9h, 0FF0h> SVt_PVGV
body_details <50h, 50h, 0, 0AAh, 0FF0h> SVt_PVLV
body_details <28h, 28h, 0, 0EBh, 0FF0h> SVt_PVAV
body_details <20h, 20h, 0, 0ECh, 0FE0h>SVt_PVHV
body_details <68h, 68h, 0, 0EDh, 0FD8h> SVt_PVCV
body_details <68h, 68h, 0, 6Eh, 820h> SVt_PVFM
body_details <88h, 88h, 0, 0EFh, 0CC0h> SVt_PVIO
body_details <28h, 28h, 0, 0F0h, 0FF0h>SVt_PVOBJ

Steps to Reproduce

C debugger, look at array PL_body_roots. Look at body_details struct in sv_inline.h.

Expected behavior

A smaller PL_body_roots array. More memory returned to OS after heavy subs, or less peak memory usage, since arena pools have less empty slots in them towards their ends. More common perl core fixed length, or really ALL core fixed length structs come from pool allocators, not malloc. Remember each pool chuck is a unit of 0x1000 or 4096 bytes, minus fixed 10-100 bytes.

Perl configuration

C:\sources\perl5>perl -V
Summary of my perl5 (revision 5 version 41 subversion 5) configuration:
  Derived from: 344512f62ca15ae427a1e05bab2887337bd534ef
  Platform:
    osname=MSWin32
    osvers=6.1.7601
    archname=MSWin32-x64-multi-thread
    uname=''
    config_args='undef'
    hint=recommended
    useposix=true
    d_sigaction=undef
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=undef
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='cl'
    ccflags ='-nologo -GF -W3 -MD -DWIN32 -D_CONSOLE -DNO_STRICT -DWIN64 -D_CRT_
SECURE_NO_DEPRECATE -D_CRT_NONSTDC_NO_DEPRECATE -D_WINSOCK_DEPRECATED_NO_WARNING
S -DPERL_TEXTMODE_SCRIPTS -DMULTIPLICITY -DPERL_IMPLICIT_SYS -DUSE_PERLIO'
    optimize='-O1 -Zi -GL -fp:precise'
    cppflags='-DWIN32'
    ccversion='19.36.32535'
    gccversion=''
    gccosandvers=''
    intsize=4
    longsize=4
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=undef
    longlongsize=8
    d_longdbl=define
    longdblsize=8
    longdblkind=0
    ivtype='__int64'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='__int64'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='link'
    ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf -ltcg -libpath:"c:\perl\
lib\CORE" -machine:AMD64 -subsystem:console,"5.02"'
    libpth="C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSV
C\14.36.32532\\lib\x64"
    libs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.li
b advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.l
ib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib
 vcruntime.lib ucrt.lib
    perllibs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg3
2.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_
32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt
.lib vcruntime.lib ucrt.lib
    libc=ucrt.lib
    so=dll
    useshrplib=true
    libperl=perl541.lib
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs
    dlext=dll
    d_dlsymun=undef
    ccdlflags=' '
    cccdlflags=' '
    lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf -ltcg -libpath:"c:
\perl\lib\CORE" -machine:AMD64 -subsystem:console,"5.02"'

Characteristics of this binary (from libperl):
  Compile-time options:
    HAS_LONG_DOUBLE
    HAS_TIMES
    HAVE_INTERP_INTERN
    MULTIPLICITY
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_HASH_FUNC_SIPHASH13
    PERL_HASH_USE_SBOX32
    PERL_IMPLICIT_SYS
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    PERL_USE_SAFE_PUTENV
    USE_64_BIT_INT
    USE_ITHREADS
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_PERLIO
    USE_PERL_ATOF
    USE_THREAD_SAFE_LOCALE
  Locally applied patches:
    uncommitted-changes
  Built under MSWin32
  Compiled at Oct 14 2024 04:40:08
  @INC:
    C:/sources/perl5/lib

C:\sources\perl5>
richardleach commented 5 days ago

Why arent these free memory pools sorted and deduped by size?

It's been in the middle of my ideas list but has never made it to the top. I do think it's worth doing this, and possibly looking to see if increasing the pool size from 4k to e.g. 8k would give less wastage.

I'm also curious to see if all bodies were allocated from arenas, whether compilers would pick up on that and automatically optimise away the existing "return to arena or Safefree" branches.

bulk88 commented 5 days ago

Why arent these free memory pools sorted and deduped by size?

It's been in the middle of my ideas list but has never made it to the top. I do think it's worth doing this, and possibly looking to see if increasing the pool size from 4k to e.g. 8k would give less wastage.

The generate_uudmap.exe cleanup branch, one reason for it was, I wanted a test,

char * p2;
char * p = malloc(1);
p2 = realloc(p,2);
if(p2 !=p)
 write_define(2);
p = p2;
malloc(2);
p2 = realloc(p,3);
if(p2 !=p)
 write_define(3);
p = p2;
malloc(3);
p2 = realloc(p,4);
if(p2 !=p)
 write_define(4);
p = p2;

and learn the actual boundaries of the OS/libc/vendor malloc, vs P5's current amateur guesses derived from the generic 4096 x86 page. Its impl specific ub, where malloc() keeps its book keeping. The traditional 2 pointers right before your ptr??? does the malloc steal 1-7 bytes below "power of 2" at the end of your alloc??? Was the OS designer bold and daring, and there IS NO HEADER, that malloc uses a red black tree????

P5 core also has 2 or 3 different malloc on malloc systems right now one is Win32 specific threads specific, other is -DDEBUGGING specific, and 3rd is P5 Configure decides OS malloc is garbage and totally replaces it. While there is some attempt at doing all the math, to correctly subtract P5 malloc wrapper headers vs build options vs our #define GOODSIZE 4096, the offsets and constants were picked decades ago, and there is no CI code to test if all those guesses and constants are correct. Its very rare a core dev, will use a C debugger and step into the OS malloc code, or use OS VM analystic tools.

A single +1 or -1 mistake in our math for GOODSIZE can perm waste 1-15 or 1-31 bytes over and over.