JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.73k stars 5.48k forks source link

'make test' fails for julia-1.11.0-beta1 #54280

Open sergstesh opened 6 months ago

sergstesh commented 6 months ago

This report is a logical continuation of https://github.com/JuliaLang/julia/issues/54234 . While dealing with https://github.com/JuliaLang/julia/issues/54234 I was requested (see https://github.com/JuliaLang/julia/issues/54234#issuecomment-2076086330 ) to run 'make test' on julia-1.11, which I did (see https://github.com/JuliaLang/julia/issues/54234#issuecomment-2078007628 ).

So, 'make test' fails for julia-1.11.0-beta1, though differently from how it fails for julia-1.10.2 .

I decided to try two more times to run 'make test' on julia-1.11.0-beta1, and to make life harder I decided to run the two 'make test' tries in parallel. For this I created two directories: /media/sergei/4c7aa17d-44cf-423c-b211-ce583883925c/home/sergei/1 , /media/sergei/4c7aa17d-44cf-423c-b211-ce583883925c/home/sergei/2 . After that I unpacked the https://github.com/JuliaLang/julia/releases/download/v1.11.0-beta1/julia-1.11.0-beta1-full.tar.gz source tarball in each of the two directories, and then in each of the two /media/sergei/4c7aa17d-44cf-423c-b211-ce583883925c/home/sergei/1/julia-1.11.0-beta1 , /media/sergei/4c7aa17d-44cf-423c-b211-ce583883925c/home/sergei/2/julia-1.11.0-beta1 directories I invoked the 'nice -n 19 make test 1>make_test.log 2>&1' command line. The command lines were invoked almost simultaneously - the difference in time was a couple of seconds.

After completion of 'make test' I created a symlink in the first directory: ln -s make_test.log make_test.1.log and in the second directory: ln -s make_test.log make_test.2.log . Now the two log files have different names and I'm uploading them.

make_test.1.log make_test.2.log .

The following has to be noted:

  1. execution time of the two 'make test' is significantly different;
  2. the two 'make test' fail in different tests.

Also it has to be noted that after completion of the two 'make test' there is a dangling Julia process:

sergei@slaptop:~$ ps auwx | grep julia | grep media
sergei    119546 17.8  0.0 5585212 3964 ?        SNsl Apr26  90:14 /media/sergei/4c7aa17d-44cf-423c-b211-ce583883925c/home/sergei/1/julia-1.11.0-beta1/usr/bin/julia -C native -J/media/sergei/4c7aa17d-44cf-423c-b211-ce583883925c/home/sergei/1/julia-1.11.0-beta1/usr/lib/julia/sys.so --depwarn=error --check-bounds=yes -g1 --startup-file=no -C native -J/media/sergei/4c7aa17d-44cf-423c-b211-ce583883925c/home/sergei/1/julia-1.11.0-beta1/usr/lib/julia/sys.so -g1 --startup-file=no --check-bounds=yes --startup-file=no --depwarn=error --heap-size-hint=3148M --bind-to 127.0.0.1 --worker
sergei@slaptop:~$ 

.

fatteneder commented 6 months ago

IIUC the logs correctly, then in make_log.2.log the loading.jl tests all passed, because there appear two loading.jl entries in that file, once with started and once with elapsed time. However, loading.jl fails in make_log.1.log (at the relocatable upgrades #51989 test, also failed here https://github.com/JuliaLang/julia/issues/54234#issuecomment-2078288296) , so there seems to be some randomness involved, but nothing obvious as to why that particular test should fail.

sergstesh commented 6 months ago

there seems to be some randomness involved

A typical phenomenon in multi-threaded environment. In the best case it's too short timeout values, otherwise algorithmic error.

...

And don't miss the dangling Julia process - looks like the kill signal hasn't been properly sent.

...

If one decides to reproduce the tests, make sure there are, say, 40GB virtual memory available (my swap file is 64GB, I saw more than 32GB swap used during the tests execution, but I have a web browser with a lot of windows/tbs open, and it consumes, say, 10GB.

...

Regarding test failures - pay special attention to

    858 loading                                          (3) |         failed at 2024-04-26T23:13:41.669
    859 Test Failed at /media/sergei/4c7aa17d-44cf-423c-b211-ce583883925c/home/sergei/1/julia-1.11.0-beta1/test/loading.jl:1365
    860   Expression: success(addenv(`$(Base.julia_cmd()) --project=$project_path --startup-file=no -e 'import Pkg; Pkg.develop("Foo51989"); Pkg.precompile(); exit(0)'`, "JULIA_DEPOT_PATH" =>    860  depot))

in 'make_test.1.log' and to

   1002 cmdlineargs                                      (5) |         failed at 2024-04-27T00:19:48.212
   1003 Test Failed at /media/sergei/4c7aa17d-44cf-423c-b211-ce583883925c/home/sergei/2/julia-1.11.0-beta1/test/cmdlineargs.jl:1150
   1004   Expression: success(pipeline(setenv(`$(Base.julia_cmd()) --bug-report=rr-local -e 'exit()'`, "JULIA_RR_RECORD_ARGS" => "-n --nested=ignore", "_RR_TRACE_DIR" => temp_trace_dir); ))

, i.e. in 'make_test.1.log' the failing test is 'loading.jl' and in 'make_test.2.log' the failing test is 'cmdlineargs.jl' - if I understand the diagnostic messages correctly.

fatteneder commented 6 months ago

Yes, I noticed that too that both failures happen when the tests are hanging inside a separately spawned Julia process, here and here.

fatteneder commented 6 months ago

When I just run this part here locally then it also takes ~2 min. The reason it takes so long is that that test sets up an empty depot without any caches, and so I think the spawned julia process then first has to re-compile all of Pkg.jl and its dependencies before the actual script for that instance can run.

If that's true, then I guess that's fixable. But I am curious as to why that problem doesn't show in CI. Maybe the CI machines are just beefier?

sergstesh commented 6 months ago

Maybe the CI machines are just beefier?

What is "CI" ?

fatteneder commented 6 months ago

What is "CI" ?

continuous integration, the automated testing pipeline