JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.43k stars 5.45k forks source link

mmap failure with address space quotas #10390

Closed JonathanAnderson closed 8 years ago

JonathanAnderson commented 9 years ago

I'm having a problem where when I build from 3c7136e when I run julia, I get the error could not allocate pools If I run as a different user, Julia runs successfully.

I think there might be something specific to my user on this box, but I am happy to help identify what is happening here.

I think this is related to https://github.com/JuliaLang/julia/pull/8699

also, from the julia-users group: https://groups.google.com/forum/#!topic/julia-users/FSIC1E6aaXk

JeffBezanson commented 9 years ago

This error is from a failing mmap, where we try to allocate 8GB of virtual address space. There might be a quota on virtual memory for some users.

pao commented 9 years ago

Oops, misread the issue, sorry.

ivarne commented 9 years ago

-v: address space (kb) 8000000 (from the julia-users thread) seems to indicate that a 8GB allocation is guaranteed to cause trouble

JeffBezanson commented 9 years ago

@carnaval Could we decrease this to, say, 4GB, to make this issue less likely?

tkelman commented 9 years ago

an 8gb array is also too large for msvc to compile, fwiw

tkelman commented 9 years ago

Can you try reducing the number on https://github.com/JuliaLang/julia/blob/e1d6e56a0781e1c453d541b87ab2cfbc14c364c1/src/gc.c#L88 by a factor of 2 or 4, see if it helps? I could also make a test branch with that change to have the buildbot make test binaries if that would be easier.

carnaval commented 9 years ago

We can certainly lower this. I set it that high under the reasoning that address space was essentially free on x64. As I understand it, operations are either O(f(number of memory mappings)) or O(f(size of resident portion)) so it should not hurt performance. I didn't think of arbitrary quotas, but it's probably better to ask, does anyone know any other drawback in allocating "unreasonable" amounts of virtual memory on 64bit arch ?

ScottPJones commented 9 years ago

@carnaval Yes, indeed... lots of performance issues if you have very large amounts of memory mapped... which is why people use huge page support...

carnaval commented 9 years ago

Keep in mind I'm still talking about uncomitted memory. The advantage of huge pages is reducing TLB contention as far as I know, and uncomitted memory sure won't end up in the TLB.

Generally, as far as my understanding of the kernel VM system goes, "dense" data structures (such as the page table, for which the TLB acts as a cache) are only filled with committed memory. The mapping itself stays into a "sparse" structure (like a list of mappings), so you only pay costs relative to the number of mappings. I may be wrong though, so I'll be happy to be corrected.

ScottPJones commented 9 years ago

I'm talking about memory that has actually been touched, i.e. commited. The issue is if you have an (opt-in at least) limit in the language, instead of just relying on things like ulimit, you can (at least in my experience) better control things, keep things from getting to the point where the OS goes bellyup. Say you have 60,000 processes running, which you know only need say 128M (unless they somehow get out of control, due to some bug)... having the limit protects you. You may also have different classes of processes that need more memory (say, loading a huge XML document), it's important to be able to also be able to allow those to dynamically (based on user roles) have a higher limit).

carnaval commented 9 years ago

That's not what my question was about though. We are already careful to decommit useless pages.

The limit is another issue, to enforce it strictly would probably require parsing /proc/self/smaps from time to time anyway to be sure some C library is not sneaking around making mappings.

ScottPJones commented 9 years ago

Yes, but does the current system ever try to proactively cut down on caches, etc., so that it can free up some memory?

It doesn't really have to be done strictly, to be useful, without fancy approaches like parsing /proc/... also, for people embedding julia, couldn't things be compiled so that at least malloc/calloc/realloc end up using a julia version, that does keep track? Having some facility to try to increase stability is better than none, even if it can't handle external memory pressures.

carnaval commented 9 years ago

I'm not arguing that we should not do those things. But those are features. I was just trying to check if someone knew that some kernel would be slow with large mappings : it would be a regression, not a missing feature.

mauro3 commented 9 years ago

I'm running into a could not allocate pools issue on a new build on a new machine (0.3 works fine). (Not sure whether this warrants a new issue or not, let me know.)

It builds fine but it crashes on running the tests, the culprit is addprocs:

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+6683 (2015-08-12 17:53 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 103f7a3* (0 days old master)
|__/                   |  x86_64-linux-gnu

julia> addprocs(3; exeflags=`--check-bounds=yes --depwarn=error`)
could not allocate pools

However, addprocs(2; exeflags=``--check-bounds=yes --depwarn=error``) works. Also starting more than three REPLs at once produces the error.

As far as I can tell there are no relevant ulimits:

 $ ulimit -a
-t: cpu time (seconds)         unlimited
-f: file size (blocks)         unlimited
-d: data seg size (kbytes)     unlimited
-s: stack size (kbytes)        8192
-c: core file size (blocks)    0
-m: resident set size (kbytes) unlimited
-u: processes                  63889
-n: file descriptors           1024
-l: locked-in-memory size (kb) 64
-v: address space (kb)         unlimited
-x: file locks                 unlimited
-i: pending signals            63889
-q: bytes in POSIX msg queues  819200
-e: max nice                   0
-r: max rt priority            0
-N 15:                         unlimited

On my normal machine the -l option is unlimited, but limiting it there to 64 does not reproduce this behavior.

mauro3 commented 9 years ago

The same problem arises using the Julia nightlies julia-0.4.0-24a92a9f5d-linux64.tar.gz.

Any ideas on how I could resolve this? Should I contact the admin of that machine to change some settings?

carnaval commented 9 years ago

yes, you can remove the 16* here https://github.com/JuliaLang/julia/blob/f40da0f18dbdfad24398ffb84d7fe2cdf40b5099/src/gc.c#L164 and recompile.

Maybe I should make that the default but it feels so silly to me for admins to restrict addr space, I don't get it really.

mauro3 commented 9 years ago

Yes, that works, thanks! Just to clarify, my understanding from this thread is that it is limiting the -v: address space (kb) which causes this. However, this is unlimited on my machine. So which one is the culprit?

GarrettJenkinson commented 9 years ago

Maybe I should make that the default but it feels so silly to me for admins to restrict addr space, I don't get it really.

@carnaval I know it is a fairly common practice in HPC clusters using a sun grid engine. Specifically, virtual memory (h_vmem) is limited to be the same size as the physical memory requested for the job (mem_free) in order to avoid oversubscription of memory:

https://arc.liv.ac.uk/pipermail/gridengine-users/2005-April/004643.html (see discussion here) https://jhpce.jhu.edu/2015/07/28/jhpce-cluster-change-to-set-user-memory-limit-tuesday-august-4th-from-600-pm-700-pm/ (see very last item here)

Therefore by having the settings the way they are presently, this forces all julia jobs to request at least ~8-9GB of ram. The default for my SGE system is 5GB (so julia crashes by default), and it costs extra money to run with more RAM, while blocking other users access to that RAM that you might not need to use otherwise. I suspect that many julia users are in HPC environments where this is the case. For us, virtual memory is exactly as costly as real memory.

Hope this information is helpful; I know very little about GC or julia internals, but love the work you all continue to do here.

tailsnaf commented 8 years ago

This issue is preventing v0.4.0 from being used where 0.3.x worked fine. I have a RHEL 5.9 machine, 16GB x64 and many packages fail to compile with the 'could not allocate pools' error. I do not appear to have address space limits set, similar to mauro3, yet this issue is happening for me.

Recompiling from source is not an option for everyone who do not have full control of their operating environment to have all necessary dev-tools, so if this memory allocation is not required, can the edit please be made the reduce the allocation, so this will end up in the nightlies?

cbecker commented 8 years ago

I am also getting this error. I managed to compile julia-0.4 if I don't start Xorg, so I was happy, but then I get the error when julia tries to pre-compile a module.

My limits look right though

~  ᐅ ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-m: resident set size (kbytes)      unlimited
-u: processes                       31567
-n: file descriptors                1024
-l: locked-in-memory size (kbytes)  unlimited
-v: address space (kbytes)          unlimited
-x: file locks                      unlimited
-i: pending signals                 31567
-q: bytes in POSIX msg queues       819200
-e: max nice                        30
-r: max rt priority                 99
-N 15:                              unlimited

I compiled from source so I can modify the code and make it work, but this may be a no-go for normal users.

[EDIT: I have 8 gigs of ram]

rudimeier commented 8 years ago

As @carnaval said:

yes, you can remove the 16* here https://github.com/JuliaLang/julia/blob/f40da0f18dbdfad24398ffb84d7fe2cdf40b5099/src/gc.c#L164

Maybe I should make that the default but it feels so silly to me for admins to restrict addr space, I don't get it really.

Your commit is the only silly thing here. Why do you think malloc() has a size argument at all? Why should one ever malloc something less than 128G?

Having such a show stopper unfixed since almost two years is ridiculous. Just fix that please.

timholy commented 8 years ago

@rudimeier, Julia is open source. That means you can fix bugs yourself! Rejoice and be thankful.

rudimeier commented 8 years ago

Seems that the good practice of disabling overcommit globally was not mentioned yet.

$ cat /proc/sys/vm/overcommit_memory
0

https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

Virtual memory is not always "cheap" in opposite to the comment in 7c8acce9. On systems where good admins have set overcommit_memory=0 julia would really steal 8GB for nothing. Many distros have overcommit_memory=0 per default, specially the enterprise ones.

Moreover: http://www.etalabs.net/overcommit.html "Overcommit is harmful because it encourages, and provides a wrong but plausible argument for, writing bad software."

BTW Julia would even malloc 8 GB on a machine with less physical memory available. Then there is no way to prevent a crash if the user really starts using it. The only way to prevent this is overcommit_memory=0. But then julia does not even start ... silly.

@timholy I am not a julia user. I'm admin of users who want's to use it. I will not change our memory settings and limits just to let them run broken software.

KristofferC commented 8 years ago

yes, you can remove the 16* here https://github.com/JuliaLang/julia/blob/f40da0f18dbdfad24398ffb84d7fe2cdf40b5099/src/gc.c#L164 and recompile.

nalimilan commented 8 years ago

@rudimeier Please use a more constructive tone. Julia developers didn't ask you to do anything. Nobody here forces you to use "broken" software.

If you ask kindly and with solid technical arguments, we'll be happy to make this work for your users, but complaining with rude words is more likely to be counter-productive.

rudimeier commented 8 years ago

@nalimilan sorry if I sound rude. Others have posted already many technical arguments and me in my second post too. Nevertheless the whole bug report sounds like this will never be fixed anyway.

Julia simply locks 8GB physical memory when running on a default Linux kernel (If it runs at all.). Moreover it will not work (at least not well) on clusters or enterprise systems where you almost certainly have to deal with ulimits.

vtjnash commented 8 years ago

the default linux kernel defaults to having no issue with this (per your link above):

0 - Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slightly more memory in this mode. This is the default.

and it may be slightly more efficient due to fewer syscalls (quoting carnaval above, "As I understand it, operations are either O(f(number of memory mappings)) or O(f(size of resident portion)")

Julia simply locks 8GB physical memory

this is never actually true, even with overcommit turned off. virtual memory != physical memory.

Moreover it will not work (at least not well) on clusters

I would have expected that a cluster that uses this flag also ensures that each program is guaranteed to be able to allocate it's full allotment. Setting that value currently involves changing a constant and compiling it into gc.c. However, other than that, it shouldn't have any impact on the other users of the cluster (the scheduler should not have permitted them to allocate this memory either) or on Julia (which subdivides this pool on-demand to meet the needs of the program).

julia developers would likely accept a PR to fix this, but the usage of ulimit in this way seems to me to just create additional non-critical work, without providing any actual benefit.

(yes, I looked at the etalabs link above. However, I respectfully disagree with their premises and analogies. graceful malloc failures are good for dealing with bad requests like (int)-1, but not a solution for avoiding OOM. Even if that app is careful about handling malloc failure to revert the database state, does it plan on also handling the ENOMEM failure of read from said OOM condition? Further, why is this database not designed to be robust in the presence of unexpected termination? My impression is that this article is written from the perspective of "software written in C", without real consideration of all of the alternative approaches to language design that exist.)

StefanKarpinski commented 8 years ago

@rudimeier, what specific issue are your users encountering when they try to run Julia?

rudimeier commented 8 years ago

On Thursday 12 November 2015, Jameson Nash wrote:

the default linux kernel defaults to having no issue with this (per your link above) [...]

Oops, my memory was wrong about the default and it's meaning. Actually overcommit_memory = 2 is the case I mean.

Julia simply locks 8GB physical memory

this is never actually true, even with overcommit turned off. virtual memory != physical memory.

In case overcommit_memory = 2, each Julia process consumes 8G of /proc/meminfo's "CommitLimit". On a machine with 16G memory and no swap you couldn't start two Julia processes.

Moreover it will not work (at least not well) on clusters

I would have expected that a cluster that uses this flag also ensures that each program is guaranteed to be able to allocate it's full allotment.

overcommit_memory is for the whole system ... all processes. ulimit is per process. Who knows on other non-linux systems you might even have per-user limits or other defaults or even "overcommit unsupported" at all.

julia developers would likely accept a PR to fix this, but the usage of ulimit in this way seems to me to just create additional non-critical work, without providing any actual benefit.

The thing is that on clusters or other multiuser systems the admin has to prevent OOM situations like we need to have HD quotas. No user should affect any other user's job. On Linux it's not possible to set a "resident memory limit". Thats why we limit virtual memory. It's not "silly" it's just what we have to do.

On systems where a scheduler manages the count of parallel running jobs you would use ulimit to restrict each job so that the sum of all jobs can't exceed the total available memory. On interactively used multiuser machines you would use overcommit_memory = 2 so that all users together at least can't crash the whole machine but hopefully only each other's processes.

(yes, I looked at the etalabs link above. However, I respectfully disagree with their premises and analogies. graceful malloc failures are good for dealing with bad requests like (int)-1, but not a solution for avoiding OOM. Even if that app is careful about handling malloc failure to revert the database state, does it plan on also handling the ENOMEM failure of read from said OOM condition? Further, why is this database not designed to be robust in the presence of unexpected termination? My impression is that this article is written from the perspective of "software written in C", without real consideration of all of the alternative approaches to language design that exist.)

Don't take the analogies too seriously. Just be aware that this overcommit design only works on systems where admins don't care about stability.

For me 8GB looks just randomly choosen. 512MB hardcoded malloc would be IMO bad enough but shouldn't be too much overhead on any recent real-life system.

StefanKarpinski commented 8 years ago

We're all for adding options to make this work for you, but to do that we need to be able to reproduce the problem and then figure out what to do about it. Can you please post some details about how to make this happen since it doesn't happen with a vanilla Linux configuration? It also doesn't happen on OS X or Windows in the default configuration.

GarrettJenkinson commented 8 years ago

we need to be able to reproduce the problem and then figure out what to do about it. Can you please post some details about how to make this happen since it doesn't happen with a vanilla Linux configuration? It also doesn't happen on OS X or Windows in the default configuration.

I cannot speak to others' errors here (in particular, @mauro3 said above that they still experienced this issue even though their ulimit -v was set to unlimited). But to mimic the problem that users such as myself experience on a computing cluster, you can make a bash sub-shell (so you don't have to mess with the settings of your machine) that sets a limit of 5GB and open julia in that subshell:

( ulimit -v 5000000; julia )

For me on my vanilla install of linux mint, this command causes the could not allocate pools error to arise, even thought the command julia runs just fine on its own.

Hope that is helpful, sorry if it is a trivial/non-useful example. As a very happy user, thanks again for this great project.

StefanKarpinski commented 8 years ago

That's very helpful – thank you. @carnaval, what are the options here? Should we introduce an environment variable that controls how big that pool is? Can we do without it altogether?

carnaval commented 8 years ago

The intent of the code is to reserve address space (uncommitted memory). It is entirely possible that the way we do it on linux relies on overcommit being enabled and that would be a bug. If there is no way to do it without overcommit I'd be quite sad given how easy it is on windows.

That does not solve arbitrary human limits though. If ulimit -v is a common thing to do then we need to deal with it. We can make those things variable length and have a CLI limit option.

vtjnash commented 8 years ago

@carnaval on linux the overcommit is not counted against the process if is mapped without PROT_WRITE (https://www.kernel.org/doc/Documentation/vm/overcommit-accounting). the commit charge is later adjusted when mprotect is called to make that range writable (or returns ENOMEM if that would cause an overcommit). the commit charge cannot be adjusted back down without munmapping the region.

cbecker commented 8 years ago

Unfortunately, this goes beyond clusters. I don't want to add noise to the thread, but as I commented above [https://github.com/JuliaLang/julia/issues/10390#issuecomment-149851612] I cannot run julia v0.4 on an 8GB RAM Arch-linux distro if KDE is running, for example.

That seems to be a typical scenario, and I found no way to fix it but modifying the code and re-compiling. I feel many people will find this annoying and it may stop them from trying julia after trying to install a package, causing a sudden program exit.

carnaval commented 8 years ago

13968 should fix most of the concerns here but I still feel bad having useless pages count against the committed size.

@vtjnash if I read that correctly the best we can do is at least start with prot_none pages and mprotect them on the first allocation ? It's weird that there is no proper way to decommit memory, even though obviously infinite overcommit "solves" the problem. I would have thought that at least some combination of madvise & mprotect would do the job.

vtjnash commented 8 years ago

https://github.com/torvalds/linux/blob/097f70b3c4d84ffccca15195bdfde3a37c0a7c0f/mm/mmap.c#L1522-L1536 https://github.com/torvalds/linux/blob/097f70b3c4d84ffccca15195bdfde3a37c0a7c0f/mm/mprotect.c#L274-L288 https://github.com/torvalds/linux/blob/097f70b3c4d84ffccca15195bdfde3a37c0a7c0f/mm/madvise.c#L259-L288

eschnett commented 8 years ago

I confirm that this change (setting REGION_PG_COUNT to 8*4096) corrects the build problem on Stampede.

I suggest to make this the default, and to backport it to 0.4, as this solves a show-stopping problem on some systems. (Other solutions would work as well, e.g. checking whether the memory could be allocated and backing off if it can't be allocated, or calling ulimit to find out how much to allocate, etc.) However, I think it's important that Julia works "out of the box".

Touching pages isn't very fast on Linux anyway. I find that you can rarely achieve more than a few GByte per second. (I'm talking e.g. about calling memset the first time right after malloc; calling memset the second time is much faster.) Given this, the overhead of an mmap call every 0.5 GByte doesn't seem too important.

Regarding setting strict virtual memory limits: This does make sense on HPC systems for various reasons. Most importantly, they exist, and they won't go away -- this is part of the policy one has to accept when using large HPC systems. I'd prefer for Julia to "just work" under such circumstances.

tkelman commented 8 years ago

rebase and review https://github.com/JuliaLang/julia/pull/13968

waTeim commented 8 years ago

I am also suffering from an aspect of this problem. Not from ulimit, but rather trying to run on a system with 512M of memory and no swap. Setting overcommit_memory to 1 tends to help but not always. This is for ARM. Julia starts out with a VM size set to 620M. Though this is not the same problem as the one reported in this issue, I noted that similar ones are being closed and marked duplicate, so here I am.

awd97 commented 8 years ago

I have access to three clusters at different universities and I had to set REGION_PG_COUNT to 8*4096 to get Julia working on two of them. I'd therefore also really be grateful if this issue was resolved!

tkelman commented 8 years ago

@carnaval @yuyichao what would the performance consequences be of just lowering that default value by a factor of 2? 4? 8? Whatever brings us in the realm of "below ulimits people seem to be hitting"

yuyichao commented 8 years ago

I'll have a look at the performance impact of related GC changes this weekend.

waTeim commented 8 years ago

As for me, this is happening on ARM for unclear reasons. The GC memory space is currently not expandable I take it. If it were, I think minimal low-end stuff could afford a heap size of at least 64M, without an issue while expecting size approaching 1G is ridiculous. Somewhere in between is the target.

Additionally, I request that this needs to be configurable via library jl_init or similar, and not expect to be controlled by running julia the executable.

yuyichao commented 8 years ago

The arm issue is completely different. This is only an issue for those who cannot control the virtual address limit. The amount of physical memory is irrelevant here.

waTeim commented 8 years ago

Well since the error message is the same it at least seems related. Are you saying this happens not because of the size of the allocation but the location? These aren't related? The previous discussion made it sound that people were having problems because the system prevented oversubscription which seems to indicate a problem with size. Well if that's the case, then that kind of makes sense too; it is true that there is a lack of OS support for 64-bit virtual addresses.

r-barnes commented 8 years ago

Attempting to compile on XSEDE's Comet raised this error. Removing 16* from gc.c allowed compilation to continue.

eschnett commented 8 years ago

Comet's front end has a severely restricted memory limit setting (ulimit). You can only allocate 2 GByte. The solution is to request a compute node interactively, and build there:

/share/apps/compute/interactive/qsubi.bash -p debug --nodes=1 --ntasks-per-node=24 -t 00:30:00 --export=ALL
floswald commented 8 years ago

Hi all, is this going to be backported to 0.4.x at some point? I'm stuck with this problem on a cluster. thanks!

tkelman commented 8 years ago

https://github.com/JuliaLang/julia/pull/16385 was a pretty large change, I'm not sure whether it can be easily backported. Are you building from source or using binaries? If the former, just change the number in the code and recompile. If the latter, I guess we could trigger an unofficial build with a smaller value.