JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.44k stars 5.46k forks source link

Sort allocate x2 more memory in master #47715

Closed gitboy16 closed 1 year ago

gitboy16 commented 1 year ago

Hi, It seems the sort function allocates twice more memory in master vs 1.8.3:

using BenchmarkTools
x = rand(10^8);
@btime sort($x);

In 1.8.3 I get 2 allocations 762 MiB vs 6 allocations 1.49GiB in master. Is that considered a regression and will it be addressed? I see that the time spent is improving from 10.2s to 6.2s. Thank you

PS: I am on Windows and have asked the above question on discourse but did not get any feedback.

brenhinkeller commented 1 year ago

Ah interesting, can reproduce on M1 mac 1.8.3:

julia> using BenchmarkTools

julia> x = rand(10^8);

julia> @btime sort($x);
  7.402 s (2 allocations: 762.94 MiB)

1.10.0-DEV.49:

julia> x = rand(10^8);

julia> @btime sort($x);
  1.712 s (6 allocations: 1.49 GiB)

While it allocates more, the new version is also quite a bit faster, so not entirely sure whether to tag with regression

brenhinkeller commented 1 year ago

Ah, looks like the in-place version remains non-allocating on both versions (while again being quite a bit faster on nightly) 1.8.3:

julia> @btime sort!($x);
  1.128 s (0 allocations: 0 bytes)

1.10.0-DEV.49:

julia> @btime sort!($x);
  109.795 ms (0 allocations: 0 bytes)
LilithHafner commented 1 year ago

It seems the sort function allocates twice more memory in master vs 1.8.3:

We chose to prioritize runtime over allocations but if these allocations pose a problem for your use case then we may have to reconsider. Do those allocations pose a problem in your use case?

Ah, looks like the in-place version remains non-allocating on both versions (while again being quite a bit faster on nightly)

The benchmarks for the in-place version are misleading. The warmup sorts x and all subsequent calls to sort! are with an already sorted vector, so the issorted fast path is triggered and it makes exactly 10^8-1 comparisons and zero writes or allocations.

You can get more truthful benchmarks by ensuring that x is randomized before each sort:

1.8.3

julia> @btime sort!($x) setup=(rand!($x)) evals=1;
  12.421 s (0 allocations: 0 bytes)

julia> @btime sort!(rand!($x));
  15.917 s (0 allocations: 0 bytes)

1.10.0-DEV.27

julia> @btime sort!(rand!($x));
  4.609 s (4 allocations: 762.95 MiB)

julia> @btime sort!($x) setup=(rand!($x)) evals=1;
  4.213 s (4 allocations: 762.95 MiB)
PallHaraldsson commented 1 year ago

I can confirm this is a problem, though probably acceptable (master not as scalable, I can sort almost twice larger arrays on 1.8.3, and I can allocate 2.48x larger arrays on master than I can sort). What's showed is the cumulative memory allocated, but even with it larger, it might not have been, with at any point in time not that much allocated, potentially lower than on 1.8.3. So I tried to find out, how much memory is allocated max for sort and sort! the in-place version more likely to be used.

I'm on Linux, with 32 GB RAM (the largest I could sort with master):

shell> /usr/bin/time -v ~/julia-1.10-DEV-ee0f3fc334/bin/julia -e "x = rand(43*10^7); @time sort!(x)"
 27.876492 seconds (320.23 k allocations: 3.225 GiB, 0.02% gc time, 1.89% compilation time)
[..]
    Maximum resident set size (kbytes): 6936776
[..]
    Minor (reclaiming a frame) page faults: 1713435
[..]

vs. with 1.8.3:

shell> /usr/bin/time -v julia -e "x = rand(43*10^7); @time sort!(x)"
 53.855762 seconds (75.78 k allocations: 3.971 MiB, 0.16% compilation time)
[..]
    Maximum resident set size (kbytes): 3555560

I though "Maximum resident set size" was most interesting, and something useful to track, not sure about the rest.

Master really uses 95% more memory, and that can be a problem with huge sorts. I think master is faster all the way up to its limit, is it important to be able to go higher, which 1.8 can do, with sort in Base? You can always use some other algorithm (even the older one if you can find it in some package).

The largest I could allocate without error (sometimes, seemingly right at the edge, and the edge likely moves over time), possibly a bit larger for master or I got lucky:

shell> /usr/bin/time -v ~/julia-1.10-DEV-ee0f3fc334/bin/julia -e "x = rand(103*10^7)"
Command terminated by signal 9

vs. what sometimes succeeded on 1.8.3:

shell> /usr/bin/time -v julia -e "x = rand(97*10^7)"
Command exited with non-zero status 1

Note, also different error messages when going to high. I didn't explore, but I suppose I could catch OOM on 1.8 with an exception, but not on master...

shell> /usr/bin/time -v julia -e "x = rand(84*10^7); @time sort!(x)"
109.869118 seconds (75.78 k allocations: 3.971 MiB, 0.09% compilation time)

I didn't find the exact limit to sorting on 1.8.3 (since timing is slow), but at least this failed:
shell> /usr/bin/time -v julia -e "x = rand(2*43*10^7); @time sort!(x)"
Command exited with non-zero status 1
PallHaraldsson commented 1 year ago

At best I can allocate only slightly more (i.e. 2% more succeeded) on 1.8.3 than on master (those are both amounts that sometimes succeed):

shell> /usr/bin/time -v ~/julia-1.10-DEV-ee0f3fc334/bin/julia -e "x = rand(129*10^7)"
Command terminated by signal 9

shell> /usr/bin/time -v julia -e "x = rand(132*10^7)"
Command exited with non-zero status 1

Note, RAM is a fixed amount, and I tried to not do anything else with the system, the random factor explaining why a certain amount sometimes fails isn't because of Julia (or rand, that just works on the memory after it's allocated and can't fail on its own).

Sort however isn't deterministic. It uses randomized quicksort by now (and scratch space, why you get allocations, I believe you can pass it in if you want none), so since its non-deterministic, and thus its timing, I suppose too its allocations could be.

Note, this is misleading:

x = rand(43*10^6); @btime sort!($x);
  106.843 ms (0 allocations: 0 bytes)

Because when you do your first sort! it allocates:

julia> @time sort!(x);
  2.557176 seconds (4 allocations: 328.072 MiB, 0.28% gc time)

[not because of compilation]

but if you do it again:

julia> @time sort!(x);
  0.140646 seconds

no allocations, since then the array was already sorted, and sort is much quicker (not a fast-path per se, at least with the usual quicksort I know, just the behavior of quicksort on presorted data, while here it might be use a different code-path avoiding potential allocations, likely always guaranteed to not allocate with presorted), that's what happens with @btime too, and even this is misleading, doesn't show clearly allocations sometimes happening:

julia> x = rand(43*10^6); @benchmark sort!($x)
BenchmarkTools.Trial: 44 samples with 1 evaluation.
 Range (min … max):  105.709 ms … 181.236 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     112.430 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   115.412 ms ±  12.162 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▃▆█▃ ▆ ▃                                                      
  █████▄█▄█▄▇▄▄▄▄▁▄▁▁▄▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  106 ms           Histogram: frequency by time          181 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

I guess BenchmarkTools.jl intentionally ignores the first run, as compilation often triggers allocs and GC. Maybe an issue should be filed for it. Let it do compilation, and not count its allocations (likely already works that way), but if you then get 4 allocations and not any in next 43 samples, i.e. 4/ = 0.09 per sample, not show as "allocs estimate: 0", but possibly "allocs estimate: 0+"?

PallHaraldsson commented 1 year ago

Do those allocations pose a problem in your use case?

Not mine. I was just looking into since I saw this issue, and I see you answered, but not before I posted, so it's at least partly redundant.

It's going to be obvious that sort, even sort! allocates if you check the right way, with @time (not BenchmarkTools), so I think real-time people should be aware, and cautious (or anyone who wants to avoid allocations for the right reasons, I think here might just scare people needlessly). But considering [quick]sort is usually in-place, non-allocating, it may be a bit confusing it now is, and might need to be documented if it isn't already, and which algorithm or option to use if you really need no allocations. [Otherwise] I don't have issues with your good work; don't think the code itself need to be changed.

brenhinkeller commented 1 year ago

Seems like this is intended and worth the tradeoff then, but we can always revisit if it becomes blocking for anyone. There are also packages that may be useful, including (if you really want no allocations and are willing to pay the time performance cost):

julia> using BenchmarkTools, Random

julia> x = rand(10^8);

julia> @btime sort!($x) setup=(rand!($x)) evals=1;
  1.636 s (4 allocations: 762.95 MiB)

julia> using VectorizedStatistics

julia> @btime vsort!($x, multithreaded=false) setup=(rand!($x)) evals=1;
  11.477 s (0 allocations: 0 bytes)
gitboy16 commented 1 year ago

For me it is an issue, because my system does not have a lot of memory. At least is it possible to keep the behaviour of 1.8.3 , i.e. via a flag or something? Some of you might want more speed some like me prefer a system that consume less memory especially when I do not have the lastest type of hardware some of you have....

KristofferC commented 1 year ago

What's the application where you are sorting arrays with a size of order more than half your RAM?

Did you notice this by running out of RAM on with an application on master vs 1.8?

gitboy16 commented 1 year ago

I work in finance and i have an application that computes risk measures that I translated to julia from c++. The c++ version consumes much less memory and cpu than the julia version. When i tested with master i noticed a slow down versus 1.8.3 because it was using much more memory and "starving"... after investigation i noticed that the sort function was consuming twice much memory than before...

brenhinkeller commented 1 year ago

@LilithHafner would it be worth adding a completely non-allocating version of sort! that also lets you explicitly provide and reuse all the necessary buffers?

gitboy16 commented 1 year ago

I first noticed the memory consumption in a discussion with @LilithHafner here: https://github.com/LilithHafner/InterLanguageSortingComparisons/issues/1 But at the time i did not port my code to julia. You can find the spec of my machine there. You will also see that on that machine boost block_indirect_sort is faster than julia sort however @LilithHafner does not have the same results. Hope it helps in improving julia sort. Thank you

brenhinkeller commented 1 year ago

It seems like a fully in-place version or variant of sort! would be worth having in general anyways and not hard to implement, so maybe it's safe to call this a "regression" of sorts until we can implement that

LilithHafner commented 1 year ago

@gitboy16, thank you for reporting this issue. Could you please share a MWE that demonstrates the behavior you describe? I.e. a piece of code that runs faster on 1.8.3 than on master.

StefanKarpinski commented 1 year ago

Maybe we can keep the old QuickSort type and add a StableQuickSort instead which becomes the default? Then anyone who needs the really low allocation behavior can explicitly ask for QuickSort. I don't think the new sort can properly be called quicksort anyway, it seems to have different enough properties: stable, needs a buffer.

gitboy16 commented 1 year ago

@LilithHafner , can't provide the code and it will depends on your hardware but all you need is to run close to the memory limit of your machine. With 1.8.3 it stays slightly under the limit but with master it exceeds it and everything starts to slowdown...

LilithHafner commented 1 year ago

I'll try to fix this before the 1.9 release.

In this case, we are using radix sort for 1.10 and the original QuickSort for 1.8. 1.8 also has partial quicksort which has been merged with 1.8's QuickSort into the single implementation of 1.10's stable and allocating PartialQuickSort (of which a special case is a stable and allocating QuickSort).

Do folks think we also need to restore the separate slower and non-allocating 1.8 PartialQuickSort?

Seelengrab commented 1 year ago

can't provide the code and it will depends on your hardware but all you need is to run close to the memory limit of your machine. With 1.8.3 it stays slightly under the limit but with master it exceeds it and everything starts to slowdown...

If you're trying to simulate an environment with little memory and you're on a systemd based system, you can use this when launching julia to prevent it from using too much physical memory at the same time:

systemd-run --user --scope -p MemoryMax=XXG -p MemorySwapMax=0 julia

--user runs it in the scope of the current user, --scope doesn't create a service but just a resource scope (so the thing systemd creates isn't persistently stored), MemoryMax=XXG limits the maximum useable memory at one time to XXG gebibytes (use K or M for kibi-/mebibytes respectively - those are using powers of 2, not powers of 10) and MemorySwapMax=0 prevents the process from cheating by swapping out memory :)

Quick demo:

[sukera@tempman ~]$ systemd-run --user --scope -p MemoryMax=1G -p MemorySw
apMax=0 julia -q
Running scope as unit: run-r28284510d81b4f7ab451813b057f3500.scope
julia> Sys.free_memory() |> Int
930631680

julia> (2e10 |> Int) / 10^C

julia> Sys.free_memory() / 1024 / 1024 / 1024
0.8667106628417969

julia> rand(Int(2e9))^C

julia> 2e9 / 1024 / 1024 / 1024
1.862645149230957

julia> rand(Int(2e9))
Killed  ## This is due to the OOM-killer - maybe julia should detect/prevent this itself?

As long as GC is checking the free memory instead of the free physical memory, this should work (loading GLMakie and plotting with it worked under this, at least, so I'd be surprised/expect it to be a bug if it didn't).

gitboy16 commented 1 year ago

Nice! Is systemd equivalent to launchd on macos?

Seelengrab commented 1 year ago

Probably? I think the only thing they have in common is that they're both init systems for their respective OS. I'm not personally familiar with launchd, but if it's anything like systemd, it should have similar functionality for limiting the amount of resources a program started with it can use.

gitboy16 commented 1 year ago

In master how can I use the same sorting algorithm as 1.8.3 to avoid allocating too much memory? Thanks

LilithHafner commented 1 year ago

sort!(my_vector; alg=QuickSort)