JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.73k stars 5.49k forks source link

The GC often doesn't act generational #40644

Open oxinabox opened 3 years ago

oxinabox commented 3 years ago

Julia has a generational mark and sweep GC. One of the properties of a mark and sweep GC is that it needs to go through and mark every reference object in memory to decide if it should be swept whenever the GC runs. This is very expensive if you have a lot of objects in memory, and the cost is incurred even if those objects are not being freed (or even read). Which is where the generational aspect comes in. It applies a heuristic to break the objects in memory into generations. Things that have been referenced for a long time (in some sense) are likely to continue to be referenced, and so are moved into the older generations. The generational GC should mark and sweep the young generations very often, and the older generations more rarely. This should mean if you have some large object with lots of references that are e.g. allocated at the start of your program and kept until the end of the program, it will make the GC slow at first since it is so large and takes so long to mark, but after a little while it will move into the older generations and very rarely be marked, and so cease to have an effect on the time it takes to do normal GCs.

The following examples use ZonedDateTimes from TimeZones@1.5.3 which are for now not isbits though that is being worked on (https://github.com/JuliaTime/TimeZones.jl/issues/271). You can create a similar example that uses any non-isbits objects such as mutable objects, things with abstract typed fields, or things with non-isbits fields (such as String or any Vector). This one I had handy.

You can see in this first example the generational GC working. At first, after the zdts are allocated there is a sharp jump in the time spent on GC, up to around 90%. But then after a few rounds, the zdts have moved into the older generation, and the time shifts spent on GC goes back down nicely to what it was at the start: around 20%

julia> # A function that creates a decent amount of garbage,
       little_burn() = sum(hash, ["trashfire"^min(1000, i) for i in 1:10_000])
little_burn (generic function with 1 method)
julia> @time little_burn();
  0.091142 seconds (10.00 k allocations: 82.467 MiB, 20.53% gc time)

julia> using TimeZones
julia> zdts = [now(tz"UTC") for _ in  1:100_000_000];
julia> for i in 1:10
           @time little_burn()
       end
  0.566361 seconds (10.00 k allocations: 82.467 MiB, 89.26% gc time)
  0.620691 seconds (10.00 k allocations: 82.467 MiB, 88.32% gc time)
  0.051410 seconds (10.00 k allocations: 82.467 MiB)
  0.603839 seconds (10.00 k allocations: 82.467 MiB, 90.91% gc time)
  0.060557 seconds (10.00 k allocations: 82.467 MiB, 17.61% gc time)
  0.066966 seconds (10.00 k allocations: 82.467 MiB, 18.17% gc time)
  0.058870 seconds (10.00 k allocations: 82.467 MiB, 19.43% gc time)
  0.060740 seconds (10.00 k allocations: 82.467 MiB, 15.33% gc time)
  0.064070 seconds (10.00 k allocations: 82.467 MiB, 19.75% gc time)
  0.060747 seconds (10.00 k allocations: 82.467 MiB, 21.03% gc time)
  0.061915 seconds (10.00 k allocations: 82.467 MiB, 18.41% gc time)

However, it's very easy to run a different function that does not display the generational behavior burn() is identical to little_burn but runs for 10x longer and allocates 10x as much memory. Like little_burn it never actually touches the zdts. with burn however the GC time always stays at 60-80%. This is due to it triggering a full sweep of both generations of our GC. Thus it keeps needing to mark all of the big and complex zdts which is slow.

julia> # A function that creates a decent amount of garbage,
       burn() = sum(hash, ["trashfire"^min(1000, i) for i in 1:100_000])
burn (generic function with 1 method)
julia> @time burn();
  0.913665 seconds (100.00 k allocations: 863.183 MiB, 19.98% gc time)

julia> using TimeZones
julia> zdts = [now(tz"UTC") for _ in  1:100_000_000];
julia> for i in 1:10
           @time burn()
       end
  3.127583 seconds (100.00 k allocations: 863.183 MiB, 78.35% gc time)
  2.073914 seconds (100.00 k allocations: 863.183 MiB, 64.82% gc time)
  2.355717 seconds (100.00 k allocations: 863.183 MiB, 77.97% gc time)
  1.842793 seconds (100.00 k allocations: 863.183 MiB, 66.04% gc time)
  2.335667 seconds (100.00 k allocations: 863.183 MiB, 74.24% gc time)
  1.804693 seconds (100.00 k allocations: 863.183 MiB, 66.61% gc time)
  2.344848 seconds (100.00 k allocations: 863.183 MiB, 73.77% gc time)
  2.396741 seconds (100.00 k allocations: 863.183 MiB, 74.44% gc time)
  1.883768 seconds (100.00 k allocations: 863.183 MiB, 64.91% gc time)
  2.262383 seconds (100.00 k allocations: 863.183 MiB, 77.07% gc time)
  1.835220 seconds (100.00 k allocations: 863.183 MiB, 64.84% gc time)

I have seen real code that takes tens of seconds to do a full mark and sweep. We really don't want to be running that very often. (if we are running that kind of mark and sweep too often it might even be better to OOM error)

It seems we need to do something to the GC to have it do a full mark and sweep less often. This might be tweaking the heuristics. It might be adding another generation. it might be both

JeffBezanson commented 2 years ago

I'm going to say this is fixed by #44215. Of course, please post more examples if there are still bad cases.

oscardssmith commented 2 years ago

Specifically, if you have examples, https://github.com/JuliaCI/GCBenchmarks is collecting them.

vtjnash commented 1 year ago

I was trying this again today, and my machine apparently just decided to run out of memory instead of ever running GC (over 30GB heap)

oscardssmith commented 1 year ago

What julia version?

vtjnash commented 1 year ago

For a related example, I paused the task for a bit inside GC to print the stats, and when I resumed it, it decided that it must now wait for 10GB of new allocations before being eligible for the next GC. I am a little confused how it thinks only 67MB are mapped right now also?

(lldb) p gc_heap_stats
(gc_heapstatus_t) $1 = (bytes_mapped = 67_125_248, bytes_resident = 67_125_248, heap_size = 7_192_808_358, heap_target = 17_053_423_782)

I didn't grab stats for the first run, but I noticed on later runs that after the first 5 GB are allocated, usually the next GC is scheduled for after the heap reaches 10-20 GB.

vtjnash commented 1 year ago
julia> versioninfo()
Julia Version 1.11.0-DEV.165
Commit 76a9772c13* (2023-07-24 15:47 UTC)
gbaraldi commented 1 year ago

Ooops. I'll take a look