Open ptiede opened 1 year ago
Is this reproducible on 1.8?
I am checking that now, but I also noticed another strange thing. When I just evaluate the function
function test(l, x)
for i in 1:10_000
arrs = map(_->Zygote.gradient(l, x), 1:1_000)
arrs = nothing
end
end
my memory monotonically increases very quickly and I trigger a OOM on 1.9. When I enable GC logging I typically see something like
GC: pause 2.67ms. collected 48.309920MB. incr
but as far as I can tell I never see a full sweep triggered.
For 1.8 when I run the exact same program my memory definitely doesn't increase as fast and I the GC logging typically reports
GC: pause 10.51ms. collected 799.020600MB. incr
and every once in a while, I get a feel sweep triggered
GC: pause 102.53ms. collected 1338.146312MB. full
I am not sure if what I am seeing is related to the original issue but something changed between 1.8 and 1.9 for me.
I originally posted this in the julia slack and was told I should open an issue about this. I am running some MCMC analysis that requires me to evaluate a function and its derivative millions of times, and I am seeing a massive performance hit (20-30x) after a few hours, even though nothing about the function has changed.
At the beginning of the julia session, I get the following benchmark results
After evaluating the same function for a few hours, I get the following:
The major difference is the GC time which is 99% of the runtime. The function does allocate and has some type instabilities due to Zygote, but there is no internal state that I am modifying (e.g., appending to an array). Additionally, I am not out of RAM, and my RAM usage sits around 20% the entire time. I should also note that the GC doesn't seem to be monotonically increasing; at some point, something happens, and suddenly, the GC pause drastically increases, and it never gets better.
Chatting with @oscardssmith, I found out that the GC is locked to full sweep, and every time the GC pauses, I see something like
even if I trigger it with
GC.gc(false)
. I don't have an MWE yet since the code is pretty lengthy, and it takes a few hours on my machine to trigger it, but I'll continue trying to find something that reproduces this. If it is useful, I can set up a repo with the code I am using to trigger this, but it is pretty large.Version Info
The project is