Closed strogdon closed 7 years ago
Is your storage nfs or some other kind of network storage?
Also run ldd -r
on one of these to see if anything is coming from outside the prefix.
There are suggestion that storage is mounted with noexec
but then you would have way more errors. Permission could be an issue but would be weird at this stage.
login is mounted via nfs but the prefix is stored on a local drive. The drive is only about 42% full - 500 GBs. The more frequent failure is glpk_backend.so
.
ldd -r /storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/numerical/backends/glpk_backend.so
linux-vdso.so.1 (0x00007ffe9c948000)
libglpk.so.40 => /storage/strogdon/gentoo-rap/usr/lib64/libglpk.so.40 (0x00007fa0fea36000)
libpython2.7.so.1.0 => /storage/strogdon/gentoo-rap/usr/lib64/libpython2.7.so.1.0 (0x00007fa0fe5f2000)
libc.so.6 => /storage/strogdon/gentoo-rap/lib64/libc.so.6 (0x00007fa0fe240000)
libgmp.so.10 => /storage/strogdon/gentoo-rap/usr/lib64/libgmp.so.10 (0x00007fa0fdfa2000)
libz.so.1 => /storage/strogdon/gentoo-rap/usr/lib64/libz.so.1 (0x00007fa0fdd85000)
libcolamd.so.0 => /storage/strogdon/gentoo-rap/usr/lib64/libcolamd.so.0 (0x00007fa0fdb7a000)
libamd.so.0 => /storage/strogdon/gentoo-rap/usr/lib64/libamd.so.0 (0x00007fa0fd96e000)
libm.so.6 => /storage/strogdon/gentoo-rap/lib64/libm.so.6 (0x00007fa0fd657000)
/storage/strogdon/gentoo-rap/lib64/ld-linux-x86-64.so.2 (0x00007fa0fefb6000)
libpthread.so.0 => /storage/strogdon/gentoo-rap/lib64/libpthread.so.0 (0x00007fa0fd437000)
libdl.so.2 => /storage/strogdon/gentoo-rap/lib64/libdl.so.2 (0x00007fa0fd233000)
libutil.so.1 => /storage/strogdon/gentoo-rap/lib64/libutil.so.1 (0x00007fa0fd030000)
ls -al /
total 181
drwxr-xr-x 26 root root 4096 Mar 7 2017 .
drwxr-xr-x 26 root root 4096 Mar 7 2017 ..
drwxr-xr-x 2 root root 4096 Feb 27 2017 bin
drwxr-xr-x 5 root root 1024 Apr 28 07:54 boot
drwxr-xr-x 15 root root 3320 Sep 2 21:11 dev
drwxr-xr-x 146 root root 12288 Sep 12 21:32 etc
drwxr-xr-x 28 root root 4096 Sep 13 2013 home
lrwxrwxrwx 1 root root 30 Sep 13 2013 initrd.img -> /boot/initrd.img-3.2.0-4-amd64
drwxr-xr-x 16 root root 4096 Sep 13 2013 lib
drwxr-xr-x 2 root root 4096 May 30 2016 lib64
drwxr-xr-x 3 root root 4096 Dec 14 2015 local
drwx------ 2 root root 16384 Sep 13 2013 lost+found
drwxr-xr-x 5 root root 4096 Oct 26 2016 media
drwxr-xr-x 2 root root 4096 Aug 26 2016 mnt
drwxr-xr-x 2 root root 4096 Sep 13 2013 opt
dr-xr-xr-x 209 root root 0 Aug 4 18:32 proc
drwx------ 17 root root 4096 Aug 28 13:58 root
drwxr-xr-x 23 root root 860 Aug 5 07:48 run
drwxr-xr-x 2 root root 12288 May 10 08:02 sbin
drwxr-xr-x 2 root root 4096 Jun 10 2012 selinux
drwxr-xr-x 2 root root 4096 Sep 13 2013 srv
drwxrwxrwt 17 root root 4096 Jul 4 20:14 storage
drwxr-xr-x 13 root root 0 Aug 4 18:32 sys
drwxrwxrwt 47 root root 69632 Sep 12 21:45 tmp
drwxr-xr-x 3 root root 4096 Sep 13 2013 umdweb
drwxr-xr-x 10 root root 4096 Sep 13 2013 usr
drwxr-xr-x 12 root root 4096 Sep 24 2013 var
lrwxrwxrwx 1 root root 26 Sep 13 2013 vmlinuz -> boot/vmlinuz-3.2.0-4-amd64
When doctesting there is a failure in sage/graphs/generic_graph.py
at line 5009. From the sage prompt
sage: C=graphs.CubeGraph(3)
sage: C.planar_dual()
Graph on 6 vertices
No failure.
I have reinstalled 8.1.beta4
and
sage -t --long /storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/graphs/generic_graph.py
too many failed tests, not using stored timings
Running doctests with ID 2017-09-14-00-17-01-8c4cece5.
Using --optional=optional,sage
Doctesting 1 file.
sage -t --long usr/lib64/python2.7/site-packages/sage/graphs/generic_graph.py
[3155 tests, 96.43 s]
----------------------------------------------------------------------
All tests passed!
----------------------------------------------------------------------
Total time for all tests: 97.7 seconds
cpu time: 75.0 seconds
cumulative wall time: 96.4 seconds
This doctest has numerous failures with 8.1.beta5
. The only package reinstalled is 8.1.beta4
. There must be something in sage that affects prefix. One more (see above failures)
sage -t --long /storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/geometry/polyhedron/base.py
too many failed tests, not using stored timings
Running doctests with ID 2017-09-14-00-12-50-64fa00d8.
Using --optional=optional,sage
Doctesting 1 file.
sage -t --long usr/lib64/python2.7/site-packages/sage/geometry/polyhedron/base.py
[935 tests, 80.43 s]
----------------------------------------------------------------------
All tests passed!
----------------------------------------------------------------------
Total time for all tests: 80.9 seconds
cpu time: 72.2 seconds
cumulative wall time: 80.4 seconds
This ticket https://trac.sagemath.org/ticket/23748 alone is not sufficient to explain my errors but it may be an indicator of the problem. With only the patches from https://trac.sagemath.org/ticket/23748 on top of 8.1.beta4
I get
sage -t --long usr/lib64/python2.7/site-packages/sage/graphs/generic_graph.py
Process DocTestWorker-1:
Traceback (most recent call last):
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/doctest/forker.py", line 1916, in run
task(self.options, self.outtmpfile, msgpipe, self.result_queue)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/doctest/forker.py", line 2247, in __call__
result_queue.put(result, False)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/multiprocessing/queues.py", line 107, in put
self._start_thread()
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/multiprocessing/queues.py", line 195, in _start_thread
self._thread.start()
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/threading.py", line 736, in start
_start_new_thread(self.__bootstrap, ())
error: can't start new thread
Bad exit: 1
**********************************************************************
and then numerous MemoryError
of the type
File "usr/lib64/python2.7/site-packages/sage/graphs/generic_graph.py", line 4344, in sage.graphs.generic_graph.GenericGraph.?
Failed example:
g439.show()
Exception raised:
Traceback (most recent call last):
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/doctest/forker.py", line 518, in _run
self.compile_and_execute(example, compiler, test.globs)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/doctest/forker.py", line 888, in compile_and_execute
exec(compiled, globs)
File "<doctest sage.graphs.generic_graph.GenericGraph.?[1]>", line 1, in <module>
g439.show()
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/graphs/generic_graph.py", line 19340, in show
return self.graphplot(**plot_kwds).show(**kwds)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/graphs/graph_plot.py", line 886, in show
self.plot().show(**kwds)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/misc/decorators.py", line 483, in wrapper
return func(*args, **kwds)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/plot/graphics.py", line 2013, in show
dm.display_immediately(self, **kwds)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/repl/rich_output/display_manager.py", line 831, in display_immediately
plain_text, rich_output = self._rich_output_formatter(obj, rich_repr_kwds)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/repl/rich_output/display_manager.py", line 623, in _rich_output_formatter
rich_output = self._call_rich_repr(obj, rich_repr_kwds)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/repl/rich_output/display_manager.py", line 581, in _call_rich_repr
return obj._rich_repr_(self, **rich_repr_kwds)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/plot/graphics.py", line 910, in _rich_repr_
self.save, kwds, file_ext, output_container)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/repl/rich_output/display_manager.py", line 711, in graphics_from_save
save_function(filename, **kwds)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/misc/decorators.py", line 483, in wrapper
return func(*args, **kwds)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/sage/plot/graphics.py", line 3215, in save
figure.tight_layout()
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/matplotlib/figure.py", line 1747, in tight_layout
renderer = get_renderer(self)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/matplotlib/tight_layout.py", line 219, in get_renderer
renderer = canvas.get_renderer()
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/matplotlib/backends/backend_agg.py", line 486, in get_renderer
self.renderer = RendererAgg(w, h, self.figure.dpi)
File "/storage/strogdon/gentoo-rap/usr/lib64/python2.7/site-packages/matplotlib/backends/backend_agg.py", line 93, in __init__
self._renderer = _RendererAgg(int(width), int(height), dpi, debug=False)
MemoryError: In RendererAgg: Out of memory
which I had seen in my doctesting 8.1.beta5
. When the generic_graph.py
doctest does not fail one can observe png
files being created under $DOT_SAGE/temp/'computer_name'/*
. With the patches from https://trac.sagemath.org/ticket/23748 nothing is created under $DOT_SAGE
. As near as I can determine https://trac.sagemath.org/ticket/23748 does not depend on anything else.
All the memory errors should come directly from that ticket. I am half expecting a slew of people eventually complaining. The other problem in generic_graph is probably a side effect. At least that ticket may showcase memory management problems in sage.
8.1.beta5
minus the patches seems to function properly here. My guess is that this issue is not prefix-specific. Prefix only exposed it.
Hum... We'll see what happens but there might be a case for this to be reversed at least in sage-on-gentoo. I'll do a follow up ticket. It is possible there are interactions with other stuff like ipython 5.4.x.
Ah, I forgot about ipython 5.4.x. I wonder if Debian has see this. Don't they use ipython 5.4?
Yup. So far I don't hear anything from them even so I have troubles that I think are related and that I echoed. But I am not sure anyone follows upstream as closely as I am.
This may be part of the probllem. In sage-runtests
there is
import resource
lim, hard = resource.getrlimit(resource.RLIMIT_AS)
if lim == resource.RLIM_INFINITY or lim > memlimit:
resource.setrlimit(resource.RLIMIT_AS, (memlimit, hard))
In prefix:
>>> import resource
>>> lim, hard = resource.getrlimit(resource.RLIMIT_AS)
>>> lim
-1
>>> hard
-1
On gentoo:
>>> import resource
>>> lim, hard = resource.getrlimit(resource.RLIMIT_AS)
>>> lim
3460300800
>>> hard
-1
The code changes the first component of resource.getrlimit(resource.RLIMIT_AS)
resource.setrlimit(resource.RLIMIT_AS, (memlimit, hard))
to memlimit
. In prefix if memlimit
is anything other that -1
things get really messed up.
ulimit -a
on both machines?
Prefix:
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 257289
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 257289
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Gentoo
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 63534
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 63534
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Cannot remmember if ulimit
tells about the soft or the hard limit. You may have to check /etc/security/limits.conf
.
Oh the fun, because you have glibc in prefix the prefix may have its own limits.conf
.
Well on both machines everything is commented out in /etc/security/limits.conf
.
I can't find a limits.conf
under $EPREFIX/etc
.
Ok so default. I will have to dig some documentation.
Prefix:
cat /proc/self/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 257289 257289 processes
Max open files 1024 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 257289 257289 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Gentoo
cat /proc/self/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 63534 63534 processes
Max open files 1024 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 63534 63534 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
So in prefix I'm wondering if something has been built where the build did not pick up on these values?
Actually the result of lim, hard = resource.getrlimit(resource.RLIMIT_AS)
you report on gentoo is suspicious. The reported value for soft (lim) is exactly what sage is trying to set the hard limit too (3300 << 20). If I am not mistaken we are looking at "max resident set.
Something is suspicious. I repeated things and I got the same results on gentoo as prefix. I must have done something wrong. But changing the first component of resource.getrlimit(resource.RLIMIT_AS)
does mess things up.
I think, effectively playing with ulimit that way is dangerous unless you know what you are doing really well. I think I will patch. It causes two failures in rings/integer.pyx
for me in pure gentoo.
I'm back in business now!
Is it all fixed now or there is some stuff we still have to look at?
I don't know of anything else. I think it is a mystery that the memory situation was exposed in Prefix. But I believe you had it on Gentoo. Close if you wish.
I have approximately 78 such failures as: