Closed sprig closed 3 months ago
Julia 1.10 generally des a lot better here. Can you try upgrading to 1.10.4?
Thanks for the reply!
Julia 1.10 generally des a lot better here. Can you try upgrading to 1.10.4?
Sorry, you may have missed this but I mentioned this failing in a 1.10.4 test environment as well - see the second to last quote block.
~However, I did some more testing and on windows 11 I was not able to reproduce this with the following version:~
Edit: I had a bug in the windows code, since I had to also use ZipFile
to unpack the data on the fly. I was able to reproduce on windows 11 with both 1.10.2 and 1.10.4. Here's the versioninfo
from 1.10.2 (currently the default version on that machine). I can unpack the data to avoid ZipFile
if necessary for some tests, but given that the MWE works on linux I've avoided this for now.
> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8 (2024-03-01 10:14 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
installed via juliaup, itself installed via the windows app store. ~I'll check to see whether this regresses on 1.10.4.~ EDIT: Yes - see above.
Edit:
I tested a bit more on windows and was able to reproduce with 1.10.2,1.10.4,1.11, as well as nightly. Here's the versioninfo from nightly:
> versioninfo()
Julia Version 1.12.0-DEV.783
Commit 07f7efd835 (2024-06-25 17:20 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
WORD_SIZE: 64
LLVM: libLLVM-17.0.6 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 8 GC (on 8 virtual cores)
On windows, I eventually start seeing a string of messages like
┌ Error: OutOfMemoryError()
└ @ Main REPL[7]:9
then
Internal error: during type inference of
string(ReadOnlyMemoryError)
Encountered unexpected error in runtime:
ReadOnlyMemoryError()
and finally julia as well as the terminal window where it was running are both killed.
As a side note, I'm not entirely sure why 10G of strings should take 90G of memory, regardless of GC.
In my experience, if you want to cause the GC to free all memory available for freeing, GC.gc
should be called multiple times in a loop, not just one time.
Can you do GC.enable_logging()
?
@gbaraldi Did this on windows, with 1.10.2 (as above);
Initially I got a bunch of messages like so:
This while julia is maxing the memory available to the system. Eventually I was shown this by windows:
And shortly afterwards the julia terminal was forcibly closed.
I'll try running this again a bit later in a terminal that keeps logs of output to see whether anything more interesting is shown by the GC debug output right before julia dies.
Well it seems that the memory it's allocating is not being freed, not that the GC isn't running. But deserves a further look anyway
Well it seems that the memory it's allocating is not being freed, not that the GC isn't running. But deserves a further look anyway
Agreed. I guess it's more of a memory leak than GC not running.
I did some further testing and actually even when adding GC.gc()
to the end of the search
function above, I still run out of memory. Additionally, I ran this in tmux so that I could get the full report and this is how the process ends (still on windows):
GC: pause 24.19ms. collected 138.104522MB. full
GC: pause 26.80ms. collected 86.853422MB. incr
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
julia> ┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
"Z:/data/file1345" => true┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
Internal error: encountered unexpected error in runtime:
ReadOnlyMemoryError()
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
┌ Error: OutOfMemoryError()
└ @ Main REPL[10]:9
My current suspicion is that having an excessive amount of tasks fills up the memory while context switching. I will try this again with a feeder task/channel and a finite amount of consumer threads.
Well, using a feeder thread with a fixed amount of workers keeps the julia process using a fixed amount of memory. Sorry for the noise.
I guess it's more of a memory leak than GC not running.
You seem to understand the issue and fix, but it's unclear if others will and would blame the GC. Can and should something be documented about this, that isn't currently done?
Sure, I will add additional explanation here; I've experienced uncontrollable memory in two situations so far -
One is when I allocated a large matrix in a tight loop. I did not create a MWE for this yet. In that case adding GC.gc()
to the end of the loop helped mitigate runaway memory, but eventually I was able to refactor the code to avoid the allocation altogether and instead use a preallocated buffer. I think this case still had a valid problem since sometimes such a buffer cannot be made and one would still want the memory to be after it's not being used. However, I didn't create an actual example for this yet since I was able to avoid allocations.
The example above still is problematic since it is using about 10x more memory than it should to hold a few thousand strings. My guess is that contains
gets compiled per string to make efficient searches? Otherwise I don't understand what would cause such high memory usage.
Either way, what I think happens is that the MWE code above spawns one task per file, and they all race to load the data, with few tasks actually finishing before memory is depleted. The workaround was to use a different pattern; One task+channel for assigning work, Another task+channel to consume the results, and 1xThreads worker tasks that loop over the work and actually load the data and perform some work on it (search it for a string in the example above).
roughly, add
cin=Channel(spawn=true) do c
foreach(fl) do fn
put!(c,fn)
end
end
change search
to just perform work:
function search(fn)
s=readchomp(fn)
res=contains(s,r"TEST")
@info "finished"
(res,fn)
end
and add some worker tasks:
function worker(cin,cout)
foreach(cin) do fn
put!(cout,search(fn))
end
end
workers=map(i=1:Threads.nthreads()) do _
@Threads.spawn worker(cin,c)
end
This way instead of several thousand tasks racing all to load data all at once, there is a fixed amount of work being done concurrently and memory stays bounded. Hopefully this is helpful for anyone who stumbles on this issue in the future.
Hi!
I'm experiencing issues in different contexts where GC doesn't run, eventually exhausting the available memory and often having OOM killer kill the julia process, despite GC having the possibility to clean much memory. I have experienced this both in numerical loops that happen to allocate memory as well as in simple data processing;
MWE:
Focusing on resident memory; In a fresh julia session, I have julia using about 200M :
After running the code above with
/data
having a roughly 25k files of roughly equal size, totaling ~10GB, Julia process is consuming almost 90G (which seems slightly excessive even if everything was actually loaded at once):After
GC.gc()
we're back to using 300M:Finally after
GC.gc(true)
we're up to using 400M although virtual memory usage decreased.If I add a
Base.Event
lock insidesearch
, memory usage does not begin to increase until theEvent
isnotify
ied, so this is not due to creating excessive tasks eg. Alternatively, if I replace the@spawn
loop by a simplemap
or even aDistributed.pmap
(without using a channel to communicate back results), memory remains bounded. However, if I do the processing via@Distrubuted.spawnat :any search(channel,fn)
in a loop/map, memory grows again, and again is released with an@everywhere GC.gc()
.A different context where I experienced this completely unrelated to channels (although it was run in parallel using
pmap
), was in code that allocated new matrices in a loop for the minors of a larger matrix. A manualGC.gc()
at the end of every iteration maintained a constant usage instead of having it grow unbounded (until I refactored the code to avoid allocations altogether).First experienced here:
this is via the following container running in podman on linux with the following image: https://hub.docker.com/layers/jupyter/datascience-notebook/x86_64-julia-1.9.3/images/sha256-98c2b44b4e44e044a8670ac27b201704e5222f8a7d748eb7cfd94a2cdad52e7d
As a test environment I attempted to run this in a test environment; in a fresh ubuntu podman container downloaded julia tarball from julialang.org, unpacked inside
/opt/
, run by path:root@fc72b99c6484:/opt# julia-1.10.4/bin/julia -t 8
On the host: