cschwan / sage-on-gentoo

(Unofficial) Gentoo Overlay for Sage- and Sage-related ebuilds
84 stars 26 forks source link

JRE "Out of Memory Error" when running doctests #622

Closed strogdon closed 3 years ago

strogdon commented 3 years ago

I have lots of Jave Runtime errors when testing sog 9.3.beta5. They were not present, as far as I know, with 9.3.beta4. I also have this with vanilla and it's been present there for some time. I hadn't noticed it because there was no apparent failure. In any event this is fairly new. The head of hs_err_pidxxxxx.log

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 32744 bytes for ChunkPool::allocate
# Possible reasons:
#   The system is out of physical RAM or swap space
#   The process is running with CompressedOops enabled, and the Java Heap may be blocking the growth of the native heap
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
#   JVM is running with Unscaled Compressed Oops mode in which the Java heap is
#     placed in the first 4GB address space. The Java Heap base address is the
#     maximum limit for the native heap growth. Please use -XX:HeapBaseMinAddress
#     to set the Java Heap base and to place the Java Heap above 4GB virtual address.
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (allocation.cpp:272), pid=10038, tid=0x00007fac05265640
#
# JRE version: OpenJDK Runtime Environment (8.0_272-b10) (build 1.8.0_272-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.272-b10 mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
kiwifb commented 3 years ago

The only thing I know for sure that calls java is our friend jmol. If it is new, then some new plots must be involved. I'll note that vanilla is seriously working on dumping jmol, I wouldn't even be too surprised if we got it optional in 9.3.

strogdon commented 3 years ago

I know you are officially away so no need to answer. I thought of jmol and I knew it could possibly be optional, but if I recall, it's needed for the pdf docs?

kiwifb commented 3 years ago

Yes, I gave myself until tomorrow to answer :) anyway most 3D plots in html and pdf doc are done through jmol which is the default plotter. threejs has a problem that you cannot do non interactive saving to a file. But jsmol may just bring all the functionality of jmol without java - although since it is javascript I am not sure how it works without at least a web-broswer in the background.

strogdon commented 3 years ago

This is preliminary, I need to run the full doctests again. There is a JRE issue somewhere in sage/categories with tp -2. I have 4 threads. There are no issues with tp -1. This is with openjdk-bin:8. I switched to icedtea-bin:8 and there are no JRE issues with tp -5.

strogdon commented 3 years ago

Corrected: above it should have been icedtea-bin:8. Doctests completed using tp -5 without a JRE issue when selecting java-vm to be icedtea-bin-8. Not sure what has changed. I wonder what other distos use?

kiwifb commented 3 years ago

I believe there is a slow move to openjdk but debian may still be shipping icedtea by default since they tend to stick to their own produced binaries.

strogdon commented 3 years ago

icedtea-bin is not the complete solution with vanilla sage, although there appears to be fewer failures than with openjdk-bin. From the failure log

Internal exceptions (2 events):
Event: 0.072 Thread 0x00007fa65000a000 Exception <a 'java/lang/NoSuchMethodError': Method sun.misc.Unsafe.defineClass(Ljava/lang/String;[BII)Ljava/lang/Class; name or signature does not match> (0x00000000dda07cc8) thrown at [/var/tmp/portage/dev-java/icedtea-3.16.0/work/icedtea-3.16.0/openjdk/
Event: 0.072 Thread 0x00007fa65000a000 Exception <a 'java/lang/NoSuchMethodError': Method sun.misc.Unsafe.prefetchRead(Ljava/lang/Object;J)V name or signature does not match> (0x00000000dda07fb0) thrown at [/var/tmp/portage/dev-java/icedtea-3.16.0/work/icedtea-3.16.0/openjdk/hotspot/src/share/

Not sure what the following means

vm_info: OpenJDK 64-Bit Server VM (25.252-b09) for linux-amd64 JRE (1.8.0_252-b09), built on May 10 2020 20:29:23 by "portage" with gcc 9.2.0
kiwifb commented 3 years ago

Feels like a language version mismatch. That kind of error is usually thrown when the arguments of a function are not the expected number, or the expected type. Because java, like C++ is object oriented, you can have function polymorphism, a same name can refer to slightly different methods depending on the arguments and the return type. The way to figure out which method is used is to compare so called "signature" of the different methods and the one you are trying to call.

So, because java is based on a runtime, I think something is missing compared to the version the program was written for.

strogdon commented 3 years ago

Installed openjdk instead of openjdk-bin and I see no JRE failures when testing vanilla. I'll now try with s-o-g. It was a chore to build openjdk. There was a filesize mismatch in downloading openjdk-8.272_p10.tar.bz2.

kiwifb commented 3 years ago

That was brave to build openjdk. The fact that the errors are linked to minor version differences is also quite worrying.

strogdon commented 3 years ago

It's taken a while. Testing s-o-g with openjdk seems good. No JRE issues. The openjdk-bin above seems to have been built with gcc-9.2? That plus your comments prompted the build of openjdk. Since openjdk seems to work here for now (cross fingers) I will not try to build icedtea since it is not stable.

strogdon commented 3 years ago

When s-o-g 9.3.beta6 is available I will run doctests with system built openjdk. But on vanilla 9.3.beta6 I get

----------------------------------------------------------------------
All tests passed!
----------------------------------------------------------------------
Total time for all tests: 7022.5 seconds
    cpu time: 24431.2 seconds
    cumulative wall time: 34315.8 seconds

which is encouraging.

kiwifb commented 3 years ago

I cannot currently build sage from sage-on-gentoo because of sandbox violation from jmol.

F: mkdir
S: deny
P: /var/lib/portage/home/.java
A: /var/lib/portage/home/.java
R: /var/lib/portage/home/.java
C: /opt/icedtea-bin-3.16.0/bin/java -Xmx512m -Djava.awt.headless=true -jar /usr/share/sage-jmol-bin/lib/JmolData.jar -iox -g 500x500 -J set defaultdirectory "/dev/shm/portage/sci-mathematics/sage-9999/homedir/.sage/temp/localhost/6946/dir_jizvbnht/scene.spt.zip"
script SCRIPT
 -j write PNG '/dev/shm/portage/sci-mathematics/sage-9999/homedir/.sage/temp/localhost/6946/dir_jizvbnht/preview.png'

times as many 3D plots probably. This is probably a bug in the java handling mechanism or something that need to be set.

strogdon commented 3 years ago

Which branch are you using?

kiwifb commented 3 years ago

vbraun, so I am a bit ahead but I suspect the issue is outside sage-on-gentoo and it has actually been going for a while but this is the first time I have been affected as root. So I suspect this is a java configuration problem, I have seen similar issues in bugzilla, https://bugs.gentoo.org/762619 has a strikingly similar sandbox problem.

strogdon commented 3 years ago

I tried to build the vbraun branch and dev-python/cypari2-2.1.2 would not build. Problems locating a bunch of .pxd files as

from __future__ import absolute_import, division, print_function

from cysignals.signals cimport sig_on, sig_off, sig_block, sig_unblock, sig_error
^
------------------------------------------------------------

cypari2/closure.pyx:36:0: 'cysignals/signals.pxd' not found

Did you do anything special? Perhaps I should just wait.

kiwifb commented 3 years ago

I'll need a more complete log. And no, I didn't need to do anything special.

strogdon commented 3 years ago

Are you using just python3.8? The failure here is with python3.9. The build using python3.8, I suppose for building in parallel, did not complete.

kiwifb commented 3 years ago

I have built with python 3.7, 3.8 and 3.9. It shouldn't build in parallel.

strogdon commented 3 years ago

I tried separately with each python (3.8 and 3.9). It builds with python3.8 but not with python3.9. The parallel build terminated because of the python3.9 failure and the build with python3.8 was not complete. I'll send the build log which is not very long.

strogdon commented 3 years ago

I think cysignals was not built for python3.9!

kiwifb commented 3 years ago

That would explain it. But why did it not build it as a dependency. Looking at the ebuild.

kiwifb commented 3 years ago

Hum, dependencies in the cypari2 ebuilds are not correct and I am not sure why. I'll get it fixed shortly.

kiwifb commented 3 years ago

Dependency problems in cypari2 fixed. Inspecting others.

kiwifb commented 3 years ago

OK, embarrassing missing python dependencies have been fixed.

strogdon commented 3 years ago

I've been able to build the vbraun branch without issue.

strogdon commented 3 years ago

I should mention that the html-docs and pdf-docs were also built.

kiwifb commented 3 years ago

Whatever I try, I still get sandbox violations around here. I wonder if I got cruft somewhere causing this.

strogdon commented 3 years ago

I'm not using a ramdisk. I don't have enough ram. I wouldn't think that would be an issue?

kiwifb commented 3 years ago

That would be a new one. I am going to try openjdk to see if it behaves better.

strogdon commented 3 years ago

Off topic, not JRE related:

I have one doctest failure with 9.3.beta6 (vbraun branch) that was not in 9.3.beta5 that seems odd

sage -t --long --random-seed=0 usr/lib/python3.8/site-packages/sage/libs/ecl.pyx  # Timed out (and interrupt failed)

When run individually, it runs forever and just fails.

kiwifb commented 3 years ago

Definitely odd.

kiwifb commented 3 years ago

Using --verbose may help figure out where it stops.

strogdon commented 3 years ago

I think the location of the failure varies. When doctesting I see

sage -t --long --random-seed=0 usr/lib/python3.8/site-packages/sage/libs/ecl.pyx
    Timed out (and interrupt failed)
**********************************************************************
Tests run before process (pid=26592) timed out:
sage: from sage.libs.ecl import test_sigint_before_ecl_sig_on ## line 121 ##
sage: test_sigint_before_ecl_sig_on() ## line 122 ##
sage: sig_on_count() # check sig_on/off pairings (virtual doctest) ## line 126 ##
0
sage: from sage.libs.ecl import test_ecl_options ## line 143 ##
sage: test_ecl_options() ## line 144 ##
ECL_OPT_INCREMENTAL_GC = 0
ECL_OPT_TRAP_SIGSEGV = 1
ECL_OPT_TRAP_SIGFPE = 1
ECL_OPT_TRAP_SIGINT = 1
ECL_OPT_TRAP_SIGILL = 1
ECL_OPT_TRAP_SIGBUS = 1
ECL_OPT_TRAP_SIGPIPE = 1
ECL_OPT_TRAP_INTERRUPT_SIGNAL = 1
ECL_OPT_SIGNAL_HANDLING_THREAD = 0
ECL_OPT_SIGNAL_QUEUE_SIZE = 16
ECL_OPT_BOOTED = 1
ECL_OPT_BIND_STACK_SIZE = 8192
ECL_OPT_BIND_STACK_SAFETY_AREA = 1024
ECL_OPT_FRAME_STACK_SIZE = 2048
ECL_OPT_FRAME_STACK_SAFETY_AREA = 128
ECL_OPT_LISP_STACK_SIZE = 32768
ECL_OPT_LISP_STACK_SAFETY_AREA = 128
ECL_OPT_C_STACK_SIZE = 0
ECL_OPT_C_STACK_SAFETY_AREA = 32768
ECL_OPT_HEAP_SIZE = 4294967296
ECL_OPT_HEAP_SAFETY_AREA = 1048576
ECL_OPT_THREAD_INTERRUPT_SIGNAL = 36
ECL_OPT_SET_GMP_MEMORY_FUNCTIONS = 0
sage: sig_on_count() # check sig_on/off pairings (virtual doctest) ## line 168 ##
0
sage: from sage.libs.ecl import * ## line 226 ##
sage: init_ecl() ## line 231 ##
sage: sig_on_count() # check sig_on/off pairings (virtual doctest) ## line 235 ##
0
sage: from sage.libs.ecl import * ## line 321 ##
sage: from cysignals.tests import interrupt_after_delay ## line 322 ##
sage: ecl_eval("(setf i 0)") ## line 323 ##
<ECL: 0>
sage: inf_loop = ecl_eval("(defun infinite() (loop (incf i)))") ## line 325 ##
sage: interrupt_after_delay(1000) ## line 326 ##
sage: inf_loop() ## line 327 ##

**********************************************************************

I don't have the result when doctesting individually, but was at a different location.

strogdon commented 3 years ago

It may be an obscure failure. I just tested again and the doctest passed

sage -t --long --warn-long 132.3 --random-seed=0 usr/lib/python3.8/site-packages/sage/libs/ecl.pyx
    [204 tests, 1.74 s]
----------------------------------------------------------------------
All tests passed!
----------------------------------------------------------------------
Total time for all tests: 1.8 seconds
    cpu time: 2.1 seconds
    cumulative wall time: 1.7 seconds

It is strange.

strogdon commented 3 years ago

Running the ecl.pyx doctest a number of times it eventually hangs. When that happens I have several sage processes

terry     7101     1  0 21:56 pts/25   00:00:00 /storage/strogdon/gentoo-rap/usr/bin/python3.8 /storage/strogdon/gentoo-rap/usr/lib/python-exec/python3.8/sage-cleaner
terry     7282  6719  1 21:57 pts/25   00:00:02 /storage/strogdon/gentoo-rap/usr/bin/python3.8 /storage/strogdon/gentoo-rap/usr/lib/python-exec/python3.8/sage-runtests --long --warn-long 132.3 --random-seed=0 usr/lib/python3.8/site-packages/sage/libs/ecl.pyx
terry     7284  7282  0 21:57 pts/25   00:00:00 [sage-cleaner] <defunct>
terry     7293  7282  1 21:57 pts/25   00:00:01 /storage/strogdon/gentoo-rap/usr/bin/python3.8 /storage/strogdon/gentoo-rap/usr/lib/python-exec/python3.8/sage-runtests --long --warn-long 132.3 --random-seed=0 usr/lib/python3.8/site-packages/sage/libs/ecl.pyx

Perhaps sage-cleaner is not doing its job.

kiwifb commented 3 years ago

What about ecl processes? There could be a lisp instance hanging for whatever reason.

strogdon commented 3 years ago

It takes a number of tries before it hangs and I don't see any ecl processes. Only that the doctest appears twice as above. And when it finally exits it is as above

strogdon commented 3 years ago

This has all been with the current vbraun branch. Is master in sync with 9.3.beta6?

kiwifb commented 3 years ago

Yes, master is in sync with 9.3.beta6 with the exception to the cypari2 dependency I think. Which shouldn't have any impact.

kiwifb commented 3 years ago

For the record, I have now enabled py3.9 in the vbraun branch and I am able to build the documentation as a user again. Hopefully that will stick when using emerge rather than ebuild. But that definitely stinks.

strogdon commented 3 years ago

I just noticed that about an hour ago. I'm now trying to build sage using py3.9 on the master branch in prefix. The html-docs are now building.

kiwifb commented 3 years ago

Still failing as root :(

strogdon commented 3 years ago

With building openjdk?

kiwifb commented 3 years ago

No just building sage, still those sandbox violations.

* ACCESS DENIED:  mkdir:        /var/lib/portage/home/.java/fonts
strogdon commented 3 years ago

Not very clear was I. Was sage built with openjdk or openjdk-bin.

kiwifb commented 3 years ago

This time it was openjdk-bin.

strogdon commented 3 years ago

Sage and all docs build here with py3.9 on Prefix and on Gentoo as root. I'm using openjdk.

strogdon commented 3 years ago

If fonts need to be generated during the build of Sage where will they be located? I don't see any generated fonts here. I guess they could have been deleted.

kiwifb commented 3 years ago

In the normal process of things, HOME is set to ${PORTAGE_BUILDDIR}/homedir during a build. So it all disappear once merged. The issue I have is that java is using portage's normal home instead of following the HOME variable.

Do you have systemd on your gentoo system? And acct-group/portage and acct-user/portage?