Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.85k stars 527 forks source link

lib/locale_threads.t is very slow to complete on Windows 11 #21611

Open sisyphus opened 6 months ago

sisyphus commented 6 months ago

In current bleadperl (5.39.5) the second of the 2 tests in lib/locale_threads.t takes 3 minutes to run on my Windows 11 box. (I initially thought it was hanging, and was quite surprised when it eventually completed successfully.) On my Ubuntu box the same test takes 17 seconds.

I'm opening this ticket in response to https://github.com/Perl/perl5/pull/21513#issuecomment-1791835941 (as requested there) - to serve as a reminder that we might want to tweak the number of threads that are created by this test, prior to the release of perl-5.40.

Cheers, Rob

Perl configuration

Summary of my perl5 (revision 5 version 39 subversion 5) configuration:
  Derived from:
  Platform:
    osname=MSWin32
    osvers=10.0.22621.2428
    archname=MSWin32-x64-multi-thread-quadmath
    uname=''
    config_args='undef'
    hint=recommended
    useposix=true
    d_sigaction=undef
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=undef
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='gcc'
    ccflags =' -DWIN32 -DWIN64 -fdiagnostics-color=never -DPERL_TEXTMODE_SCRIPTS -DMULTIPLICITY -DPERL_IMPLICIT_SYS -DUSE_PERLIO -D__USE_MINGW_ANSI_STDIO -fwrapv -fno-strict-aliasing -mms-bitfields'
    optimize='-Os'
    cppflags='-DWIN32'
    ccversion=''
    gccversion='13.2.0'
    gccosandvers=''
    intsize=4
    longsize=4
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long long'
    ivsize=8
    nvtype='__float128'
    nvsize=16
    Off_t='long long'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='g++'
    ldflags ='-s -L"D:\perl-5.39.41-1320-ucrt-mcf\lib\MSWin32-x64-multi-thread-quadmath\CORE" -L"C:\winlibs-ucrt-1320\mingw64\lib" -L"C:\winlibs-ucrt-1320\mingw64\x86_64-w64-mingw32\lib" -L"C:\winlibs-ucrt-1320\mingw64\lib\gcc\x86_64-w64-mingw32\13.2.0"'
    libpth=C:\winlibs-ucrt-1320\mingw64\lib C:\winlibs-ucrt-1320\mingw64\x86_64-w64-mingw32\lib C:\winlibs-ucrt-1320\mingw64\lib\gcc\x86_64-w64-mingw32\13.2.0 D:\_64\msys_1320\1.0\local\lib
    libs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32 -lquadmath
    perllibs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32 -lquadmath
    libc=
    so=dll
    useshrplib=true
    libperl=libperl539.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs
    dlext=dll
    d_dlsymun=undef
    ccdlflags=' '
    cccdlflags=' '
    lddlflags='-shared -s -L"D:\perl-5.39.41-1320-ucrt-mcf\lib\MSWin32-x64-multi-thread-quadmath\CORE" -L"C:\winlibs-ucrt-1320\mingw64\lib" -L"C:\winlibs-ucrt-1320\mingw64\x86_64-w64-mingw32\lib" -L"C:\winlibs-ucrt-1320\mingw64\lib\gcc\x86_64-w64-mingw32\13.2.0"'

Characteristics of this binary (from libperl):
  Compile-time options:
    HAS_LONG_DOUBLE
    HAS_STRTOLD
    HAS_TIMES
    HAVE_INTERP_INTERN
    MULTIPLICITY
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_HASH_FUNC_SIPHASH13
    PERL_HASH_USE_SBOX32
    PERL_IMPLICIT_SYS
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    PERL_USE_SAFE_PUTENV
    USE_64_BIT_INT
    USE_ITHREADS
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_PERLIO
    USE_PERL_ATOF
    USE_QUADMATH
    USE_THREAD_SAFE_LOCALE
  Locally applied patches:
    uncommitted-changes
  Built under MSWin32
  Compiled at Nov  2 2023 16:21:23
  @INC:
    .\lib
    D:/comp-1320-ucrt/perl-5.39.41/lib
sisyphus commented 3 months ago

Update: As of today (blead at commit 31e18734071), the slow test is still just as slow. I timed it to take 2 minutes 55 seconds.

In the original post (above), I was using a MSWin32-x64-multi-thread-quadmath build of perl-5.39.5. For todays test I used a MSWin32-x64-multi-thread build of current blead.

khwilliamson commented 3 months ago

On 2/6/24 17:27, sisyphus wrote:

Update: As of today (blead at commit 31e1873 https://github.com/Perl/perl5/commit/31e187340716a1b729ab3e475dd9a578b28e402f), the slow test is still just as slow. I timed it to take 2 minutes 55 seconds.

In the original post (above), I was using a |MSWin32-x64-multi-thread-quadmath| build of perl-5.39.5. For todays test I used a |MSWin32-x64-multi-thread| build of current blead.

Thanks

Do you have enough knowledge of Visual Studio to easily try its profiling feature, which I know about because Tony has used it in the past?

— Reply to this email directly, view it on GitHub https://github.com/Perl/perl5/issues/21611#issuecomment-1931015716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2DH2NNJ6T6ACIZD67FDLYSLC6HAVCNFSM6AAAAAA632YNG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZRGAYTKNZRGY. You are receiving this because you are subscribed to this thread.Message ID: @.***>

sisyphus commented 3 months ago

Do you have enough knowledge of Visual Studio to easily try its profiling feature

No - sorry. BTW, in case I haven't already mentioned it, perl built using VS2022 takes the same amount of time to run the test as the mingw build. (That is , there's no more than 5 seconds difference.)

tonycoz commented 1 month ago

The code isn't really structured for profiling, the code that does the actual exercise runs in a sub-process which the profiler doesn't profile itself (it's just a long 0% CPU usage).

I managed to extract the generated code and its data file and loaded that into the profiler.

I suspect we switch locales too much: image image image

A large part of it in my test run was in sort() calls: image

which may be an optimization opportunity for sort() on systems with per-thread locales.

flame graph: image

Note that many of the unlabelled "flames" in the flame graph are also setlocale() chains.

Note that it didn't seem to thread much: CPU usage as a total percentage rarely went above 12% (on a 16 "core" system, or about 2 threads at a time).

Running the test on Linux seems similar:

$ time ./perl -Ilib lib/locale_threads.t 
1..2
ok 1 - Didn't segfault
ok 2 - Verify there were no failures with simultaneous running threads

real    1m2.099s
user    1m8.512s
sys     0m2.221s

Though that might be that most of the cost of the test is setting up the program to run.

khwilliamson commented 4 weeks ago

The whole point of this test is to look for races involved with locales. To that end, it is constantly changing the locale so as to trigger any races. One obvious inefficiency is that it is constantly malloc'ing and destroying two buffers. Those could be converted to be dedicated, and freed only on program termination. But if this is a bottleneck, wouldn't the mallocs have shown up in the profile?

I have noticed that threading in Windows doesn't seem to actually do much concurrency. Even compiling with gmake -j18 on my 12-core box has low occupancy. I know not where the fault lies.

tonycoz commented 4 weeks ago

The whole point of this test is to look for races involved with locales. To that end, it is constantly changing the locale so as to trigger any races. One obvious inefficiency is that it is constantly malloc'ing and destroying two buffers. Those could be converted to be dedicated, and freed only on program termination. But if this is a bottleneck, wouldn't the mallocs have shown up in the profile?

safesysfree() shows in the first call stack (second image), but I expect it's drowned out by the _wsetlocale() calls.

I have noticed that threading in Windows doesn't seem to actually do much concurrency. Even compiling with gmake -j18 on my 12-core box has low occupancy. I know not where the fault lies.

That's more to do with how we structure the parallelism in GNUmakefile:

  1. the components for miniperl should build in parallel up to the limit
  2. non-xs extensions and xs-extensions are built in parallel with each other, but each non-xs extension is built in sequence with other non-xs extensions, and xs-extensions are built in sequence with other xs-extension
  3. the components for libperl5xx.dll should build in parallel up to the limit

and since building extensions is a large part of the build, we don't get much apparent parallelism.

That doesn't explain the lack of parallelism in the test though, I suspect that's a combination of lock conflicts between threads and the actual time needed to do the locale change, we're not spending 90% of the runtime loading a locale, but 10% of the runtime loading the locale and 80% of the time threads are waiting on the __acrt_locale_lock (inside _wsetlocale()).

This would result in the low CPU usage with the large percentage of time spent in _wsetlocale().

I don't see what we could do to fix that though.

sisyphus commented 4 weeks ago

I think it would be polite to have the script output something at intervals, just to demonstrate that it hasn't hung. Or at least a (windows-specific ?) warning that "this test can take a while to complete". But if it's a test that takes a while to run, then I think that all reasonable people will readily accept that.

On my Windows 11 box, today's timings for that test were approx. 2.5 minutes on 64-bit perl, and approx. 3.5 minutes on 32-bit perl. (I think that's the first time I've ever measured the time taken on a 32-bit perl.)

khwilliamson commented 3 weeks ago

safesysfree() shows in the first call stack (second image), but I expect it's drowned out by the _wsetlocale() calls.

I had seen that, but I would have thought the mallocs would take a lot longer than a free, so it doesn't make sense to me that the frees would show up but not the mallocs

khwilliamson commented 3 weeks ago

I think it would be polite to have the script output something at intervals, just to demonstrate that it hasn't hung. Or at least a (windows-specific ?) warning that "this test can take a while to complete". But if it's a test that takes a while to run, then I think that all reasonable people will readily accept that.

The point of this test is to find races. Such a test should be configured to run long enough so that it almost always fails if there is a real race. The current length is such that on some platforms it has failed occasionally and succeeded more often. So it could stand to run longer on such platforms. On my Linux box, I've run it for all 500-ish available locales and 1000 iterations. That takes a lot of memory and 10s of minutes.

The test has uncovered libc bugs on several platforms. This is an undertested area for libc implementations. Windows MSVC has had thread-safe locales available for almost 20 years now, and UCRT, available for a few years now, seems to have fixed things quite well Any bugs I've had on it have shown up in just a few iterations, and have all been my fault. MingW is only recently available with UCRT, and the bugs with the old CRT seem to show up in just a few iterations.

What I'm saying then is we could cut the iteration count on some platforms, such as Windows.

The setup time is significant though. I'm not advocating this, but I'm noting that the setup result doesn't change from run to run on a given platform unless a locale is added to or removed from that platform. Thus there are opportunities to run it only rarely, caching the result.

tonycoz commented 3 weeks ago

Debug/unoptimized builds on Linux are also pretty slow here:

tony@venus:.../perl6/t$ time ./perl harness ../lib/locale_threads.t
../lib/locale_threads.t .. ok   
All tests successful.
Files=1, Tests=2, 123 wallclock secs ( 0.01 usr  0.00 sys + 144.27 cusr  2.08 csys = 146.36 CPU)
Result: PASS
Finished test run at Tue Apr 23 10:44:56 2024.

real    2m3.664s
user    2m24.357s
sys     0m2.098s

That's with 880 locales according to locale -a | wc -l

haarg commented 2 weeks ago

@khwilliamson Is this still a release blocker? This is the last remaining technical issue marked as a blocker.

khwilliamson commented 1 week ago

I think there should be some consensus resolution before 5.40 is released. There have been substantive comments from just the OP and one person besides myself on this, and no consensus developing.

It requires significant time to analyze the platform's locales and set up as disparate values as possible. One of the things it detects is bleed-through between threads. If multiple locales have the same value for a particular item, then any bleed-through would go undetected, as each locale would return the same thing. Hence it tries to make sure that we use locales whose values differ. This requires significant analysis. For example, many locales use a dot for the numeric radix (decimal point) character; and many use a comma. The analysis tries to make sure that some of each are simultaneously tested. But this analysis need not be done every run of this. It doesn't change until the system's locales are added, removed, or updated. How to detect that, I don't know.

But the actual test runs long too. I thought 15 simultaneous locales run 100 iterations each was a bare minimum to tease out failures. Ideally it would be larger. Some bugs found occur immediately; others are due to races that line up only occasionally.

The OP has indicated he would be satisfied with some messages displayed that this is a long test. But no one else has said anything about this.

I think this ticket needs more eyes to evaluate it. Even if this topic is outside your comfort zone, please think about it

jkeenan commented 1 week ago

Unless someone can demonstrate that there was a change in blead between v5.38.0 and the point in v5.39.5 when @sisyphus filed this ticket, I don't think this should be classified as a release blocker. It's a high-performance test being run on a non-high-performance system.

leonerd commented 1 week ago

This doesn't feel like it's actually a release blocker as such.

I think it would be good to put in some sort of message to warn users "this is a slow test, it may take multiple minutes", but beyond that unless we feel that reducing the iteration count doesn't invalidate the test, I think that should be sufficient. From the description it sounds like this is only multiple-minutes long on Windows, and really how many end-users will be building Perl from source on Windows anyway? I thought the overwhelming majority just install some prebuilt binary distribution.

It might be good to look into ways to speed it up for developers but we can do that into 5.41 when we reöpen for development.

iabyn commented 5 days ago

On Thu, May 09, 2024 at 08:10:55AM -0700, Paul Evans wrote:

This doesn't feel like it's actually a release blocker as such.

I think it would be good to put in some sort of message to warn users "this is a slow test, it may take multiple minutes", but beyond that unless we feel that reducing the iteration count doesn't invalidate the test, I think that should be sufficient. From the description it sounds like this is only multiple-minutes long on Windows, and really how many end-users will be building Perl from source on Windows anyway? I thought the overwhelming majority just install some prebuilt binary distribution.

It might be good to look into ways to speed it up for developers but we can do that into 5.41 when we reöpen for development.

+1

-- Decaffeinated coffee is like dehydrated water

jkeenan commented 5 days ago

This doesn't feel like it's actually a release blocker as such. ... I think it would be good to put in some sort of message to warn users "this is a slow test, it may take multiple minutes", but beyond that unless we feel that reducing the iteration count doesn't invalidate the test, I think that should be sufficient. ...

It might be good to look into ways to speed it up for developers but we can do that into 5.41 when we reöpen for development.

Accordingly, removing the release blocker label now.