JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.4k stars 5.45k forks source link

TTFX regression with Comonicon CLI on Julia 1.11 #55171

Open timholy opened 1 month ago

timholy commented 1 month ago

Following up on a slack conversation, there is indeed a substantial regression in the responsivity of a CLI application written with Comonicon:

Julia 1.10:

~/.julia/bin$ time ./cli --help
# lots of output
real    0m0.285s
user    0m0.395s
sys     0m0.642s

Julia 1.11-rc1:

real    0m3.681s
user    0m3.656s
sys     0m0.866s

I haven't really participated in the Julia 1.11 release and don't have time to check myself, but since it's a pretty sizable regression I'm slapping a milestone on it. (I am not personally affected by CLI TTFX, though.)

Here are the specifics of what I tested:

tim@diva:/tmp/TTFX/CLI$ tree
.
├── Manifest.toml
├── Project.toml
└── src
    └── CLI.jl

1 directory, 3 files
tim@diva:/tmp/TTFX/CLI$ cat src/CLI.jl
module CLI

using Comonicon

"""
ArgParse example implemented in Comonicon.

# Arguments

- `x`: an argument

# Options

- `--opt1 <arg>`: an option
- `-o, --opt2 <arg>`: another option

# Flags

- `-f, --flag`: a flag
"""
Comonicon.@main function main(x; opt1=1, opt2::Int=2, flag=false)
    println("Parsed args:")
    println("flag=>", flag)
    println("arg=>", x)
    println("opt1=>", opt1)
    println("opt2=>", opt2)
end

end # module CLI

Build with

(CLI) pkg> build

and then test with

~/.julia/bin$ time ./cli --help
timholy commented 1 month ago

Oh, and if anyone is looking into this, note that the master branch of SnoopCompile works on Julia 1.11. See this issue for progress on the release: https://github.com/timholy/SnoopCompile.jl/issues/382

IanButterworth commented 1 month ago

I tried to replicate this but

(CLI) pkg> st
Project CLI v0.1.0
Status `~/Documents/GitHub/CLI/Project.toml`
  [863f3e99] Comonicon v1.0.8

(CLI) pkg> build

shell> ls ~/.julia/bin
ls: /Users/ian/.julia/bin: No such file or directory

If anyone familiar with Comonicon can help I'd be happy to investigate (@timholy mentioned on slack that he wasn't sure why this didn't build ~/.julia/bin/cli)

Sbozzolo commented 1 month ago

I don't know if this is related, but we are seeing massive latency increases in ClimaAtmos.jl with Julia 1.11-rc1 (https://github.com/CliMA/ClimaAtmos.jl/issues/3186).

I was led to this issue through SnoopCompile. ClimaAtmos.jl does no longer compile in any reasonable time with Julia 1.11-rc1 (or previous betas). I identified that the offending function is get_atmos, a simple function that returns a keyword-defined type. For the most part, this function simply converts string keywords read from YAML files to types, then, it returns a AtmosModel.

I used SnoopCompile (master) to try to get a little more insight. The get_atmos function, as it is, takes too long to compile (on Julia 1.10, this is instantaneous), so I had to remove part of it (as mentioned in the referenced issue). Once I do so, I see with @snoop_inference that:

InferenceTimingNode: 1.171199/288.711156 on Core.Compiler.Timings.ROOT() with 21 direct children

Plotting the inference: image

All the time is spent in types.jl line 329. This line simply defines our struct:

Base.@kwdef struct AtmosModel{
    MC,
    MM,
    PM,
    CM,
    CCDPS,
    F,
    S,
    RM,
    LA,
    EXTFORCING,
    EC,
    AT,
    TM,
    EEM,
    EDM,
    ESMF,
    ESDF,
    ENP,
    EVR,
    TCM,
    NOGW,
    OGW,
    HD,
    VD,
    DM,
    SAM,
    VS,
    RS,
    ST,
    IN,
    SM,
    SA,
    NUM,
}
    model_config::MC = nothing
    moisture_model::MM = nothing
    precip_model::PM = nothing
    cloud_model::CM = nothing
    call_cloud_diagnostics_per_stage::CCDPS = nothing
    forcing_type::F = nothing
    subsidence::S = nothing
    radiation_mode::RM = nothing
    ls_adv::LA = nothing
    external_forcing::EXTFORCING = nothing
    edmf_coriolis::EC = nothing
    advection_test::AT = nothing
    tendency_model::TM = nothing
    edmfx_entr_model::EEM = nothing
    edmfx_detr_model::EDM = nothing
    edmfx_sgs_mass_flux::ESMF = nothing
    edmfx_sgs_diffusive_flux::ESDF = nothing
    edmfx_nh_pressure::ENP = nothing
    edmfx_filter::EVR = nothing
    turbconv_model::TCM = nothing
    non_orographic_gravity_wave::NOGW = nothing
    orographic_gravity_wave::OGW = nothing
    hyperdiff::HD = nothing
    vert_diff::VD = nothing
    diff_mode::DM = nothing
    sgs_adv_mode::SAM = nothing
    viscous_sponge::VS = nothing
    rayleigh_sponge::RS = nothing
    sfc_temperature::ST = nothing
    insolation::IN = nothing
    surface_model::SM = nothing
    surface_albedo::SA = nothing
    numerics::NUM = nothing
end

I have not tried without keyword arguments.

I don't think that what we are doing up to this point is particularly complex or unorthodox, and most of the types involved are very simple (mostly singletons and bools, all immutable).

Should I open a new issue for this?

KristofferC commented 1 month ago

Yes, please open a new issue since it is unlikely that it is related to this issue.

Roger-luo commented 1 month ago

Hi you can try the example package here: https://github.com/comonicon/Comonicon.jl/tree/main/example/Hello

and you can install the entry point by running Hello.comonicon_install()

help?> Hello.comonicon_install
  comonicon_install(;kwargs...)

  Install the CLI manually. This will use the default configuration in
  Comonicon.toml, if it exists. For more detailed reference, please refer to
  Comonicon documentation (https://comonicon.org).

you don't have to test that with an entry point, you can test directly with its Julia entry point

help?> Hello.command_main
  No documentation found.

  Hello.command_main is a Function.

  # 2 methods for generic function "command_main" from Hello:
   [1] command_main(ARGS::Vector{String})
       @ ~/Code/Julia/Comonicon/src/codegen/julia.jl:90
   [2] command_main()
       @ ~/Code/Julia/Comonicon/src/codegen/julia.jl:90
KristofferC commented 1 month ago

I get:

# 1.10
hello --help  0.37s user 0.31s system 308% cpu 0.219 total

# 1.11
hello --help  1.01s user 1.58s system 354% cpu 0.731 total

And considering both Pkg and REPL is loaded, this does not seem very surprising.

timholy commented 1 month ago

Hmm, it was much worse for me, I wonder why. Agreed this seems more reasonable.

Is it all TTL? There seems to be a path forward that Jameson and I sketched out for significantly reducing TTL, but it will take some time to explore. Therefore, let me ask a heretical question: if Pkg & REPL are the two culprits and they come up a lot, how important for 1.11 is their exclusion from the sysimg? Clearly we want this eventually, but until juliac is out, I guess I'm unclear on whether the advantages of their exclusion more than compensate for the disadvantages. Something that might be worth considering is "baking them in" for 1.11 (but hopefully that's just a couple of new lines to re-include them, I'm not suggesting we revert any hard work) with the expectation that they'll be excluded in a future release where either the TTL issue has been fixed or the advantages of their exclusion clearly outweigh the disadvantages.

If "putting them back" is not just a couple of lines, then you can ignore this.

IanButterworth commented 1 month ago

@timholy whats your platform? WSL2?

KristofferC commented 1 month ago

Regarding this specific case, it seems "self-inflicted". The Hello app here depends on the full Comonicon.jl which pulls in a lot of dependencies (Pkg, REPL) to both do argument parsing and to add a Hello.comonicon_install command inside the app itself. To me, this would better be structured as having Comonicon install the app (Comonicon.install("examples/Hello")) and having a smaller dependency that handles argument parsing that the app itself would depend on.

KristofferC commented 1 month ago

As of right now, just adding Pkg to the sysimage doesn't work:

LoadError("sysimg.jl", 57, LoadError("/Users/kristoffercarlsson/julia/usr/share/julia/stdlib/v1.12/Pkg/src/Pkg.jl", 3, 
  LoadError("/Users/kristoffercarlsson/julia/usr/share/julia/stdlib/v1.12/Pkg/src/precompile.jl", 101, 
    Base.IOError("could not spawn setenv(`'' a registries/Registry.tar.gz -tgzip

which I guess is from https://github.com/JuliaLang/Pkg.jl/blob/da9f89ffc7df973d273a3952051ad656987bc64b/src/PlatformEngines.jl#L20 and that jlls don't work in the process creating a sysimage since __init__ is disabled in that process.

IanButterworth commented 1 month ago

Including stdlibs shouldn't be expensive, and in my testing it's not on MacOS nor native linux. And from what I hear native windows neither.

I suspect this is another edge case where WSL2 (which I believe Tim is using) is so slow (100x slower than native linux) for file system operations that our loading code fs stuff is getting in the way.

We can probably replace some readdir's with the much more efficient _readdirx in there.

IanButterworth commented 1 month ago

A profile would probably help.

timholy commented 1 month ago

In trying to install perf I'm currently hitting https://github.com/microsoft/WSL/issues/10913. I should probably upgrade to 22.04, but that will take some time. In the meantime, given that this is WSL-specific then I agree with removing this from the milestone.

Roger-luo commented 1 month ago

I'm slowly refactoring things in Comonicon to have features like app installation done as a plugin. The package was done in pre-package-image time, so everything was generated as a zero dependency code and injected into the user's code space via a macro. As @KristofferC mentioned, this is no longer necessary and can be done with a separate runtime package to support fancy CLI features (which is what I'm planning).

If excluding stdlib is the only reason for slowing it, I'm OK with not fixing it.

topolarity commented 1 month ago

another edge case where WSL2 (which I believe Tim is using) is so slow (100x slower than native linux) for file system operations that our loading code fs stuff is getting in the way.

I'm not sure that fully explains the slowdown here - WSL2 has very slow filesystem operations, but only when accessing files across the bridge to the native Windows FS.

When accessing files in the Linux filesystem, it should be pretty competitive. Is @timholy keeping his depot on the Windows side?

IanButterworth commented 1 month ago

Tim provided some redone timing info to me for https://github.com/JuliaLang/julia/pull/55331 and saw faster baseline times than those which were reported here, for unknown reasons.

(Note my understanding is that Tim's julia installation and depot resides within the linux filesystem on WSL2)

topolarity commented 1 month ago

I'm able to reproduce this on my machine, although I have to have assertions on to get numbers almost as bad as in the OP:

# julia +1.10
~/.julia/bin/cli --help  0.23s user 0.30s system 239% cpu 0.220 total
# julia +nightly
~/.julia/bin/cli --help  1.40s user 0.20s system 138% cpu 1.157 total
# local master (w/ assertions)
~/.julia/bin/cli --help  2.18s user 0.14s system 121% cpu 1.911 total

In case it's helpful, here's an strace -r -tt -T from the nightly and 1.10 runs: cli_strace_1_10.log cli_strace_nightly.log

I haven't looked closely enough at the trace yet to know what's dominating here, but we are making ~8x as many syscalls as we used to.

KristofferC commented 1 month ago

I'm not surprised the syscalls are (much) more now because we went from opening 1 package to a lot of them. But looking at the logs there are some cases like

18:16:27.787691 (+     0.000211) statx(AT_FDCWD, "/home/topolarity/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Logging/Project.toml", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=165, ...}) = 0 <0.000093>
18:16:27.787910 (+     0.000218) statx(AT_FDCWD, "/home/topolarity/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Logging/Project.toml", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=165, ...}) = 0 <0.000090>

18:16:27.790806 (+     0.000760) statx(AT_FDCWD, "/home/topolarity/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Logging/src/Logging.jl", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=2404, ...}) = 0 <0.000110>
18:16:27.791096 (+     0.000289) statx(AT_FDCWD, "/home/topolarity/.julia/juliaup/julia-nightly/share/julia/stdlib/v1.12/Logging/src/Logging.jl", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=2404, ...}) = 0 <0.000116>

which are two identical stat calls for the same file at almost the same time which is probably not needed?

topolarity commented 1 month ago

Yeah, agreed - Scrolling the log doesn't show any particular dominator, just a lot of package loading.

This is without https://github.com/JuliaLang/julia/pull/55331 so there's a chance Ian has already cleaned up some of that 🙂

topolarity commented 1 month ago

The summarized results are also interesting (strace -c):

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 50.76    0.103150          14      7087           read
 13.25    0.026934          11      2326      1253 access
 13.00    0.026428          10      2588         1 lseek
  6.84    0.013900           8      1615       118 statx
  5.28    0.010729          11       973        97 openat
  4.17    0.008476           9       878           close
  1.36    0.002770           7       378           mmap
  1.21    0.002454           9       258        72 newfstatat
  0.88    0.001798           5       322           futex
  0.85    0.001722          13       132           getdents64
  0.50    0.001020           9       103           mprotect
  0.43    0.000874           9        92           brk
 ...
------ ----------- ----------- --------- --------- ------------------
100.00    0.203227          11     17065      1584 total

So we're dominated by reads, most of which are us opening all of the .so files.

But in total we're only spending 15-20% of the time doing syscalls (1.38s user 0.18s system).

Taking a perf, it looks like we're suffering from the usual culprits for package loading: image

(data is in profile.linux-perf.zip), image is taken from https://www.speedscope.app/)

topolarity commented 1 month ago

It's also worth investigating why @KristofferC's results seem to be much better parallelized...

For an apples-to-apples test, here's the build.jl that I had to add to the MWE in the OP:

$ cat deps/build.jl
# build.jl
using CLI; CLI.comonicon_install()
KristofferC commented 1 month ago

This is what I am running (on an M1):

~/JuliaTests/CLI
❯ time julia +1.10 --project -e 'using CLI; exit(CLI.command_main())' -- --help  1> /dev/null
julia +1.10 --project -e 'using CLI; exit(CLI.command_main())' -- --help >   0.56s user 0.33s system 360% cpu 0.248 total

~/JuliaTests/CLI
❯ time julia +1.11 --project -e 'using CLI; exit(CLI.command_main())' -- --help  1> /dev/null
julia +1.11 --project -e 'using CLI; exit(CLI.command_main())' -- --help >   1.58s user 0.82s system 203% cpu 1.181 total