gap-system / gap

Main development repository for GAP - Groups, Algorithms, Programming, a System for Computational Discrete Algebra
https://www.gap-system.org
GNU General Public License v2.0
813 stars 161 forks source link

GAP on Cygwin 64-bit can die unexpectedly on some integer operations that use GMP #3434

Closed embray closed 5 years ago

embray commented 5 years ago

Observed behaviour

I first observed this behavior in the following case, with GAP built on 64-bit Cygwin using the system GMP package (not MPIR):

$ ./gap.exe -A -m 64m
 ┌───────┐   GAP 4.10dev-1885-gd0f592a of today
 │  GAP  │   https://www.gap-system.org
 └───────┘   Architecture: x86_64-unknown-cygwin-default64-kv6
 Configuration:  gmp 6.1.2, GASMAN, readline, KernelDebug
 Loading the library and packages ...
 Packages:   GAPDoc 1.6.2, PrimGrp 3.3.2, SmallGrp 1.3, TransGrp 2.0.4
 Try '??help' for help. See also '?copyright', '?cite' and '?authors'
gap> Bernoulli(1742);

$
$ echo $?
0

That is, when calling Bernoulli() GAP simply exits with exit code 0.

Expected behaviour

gap> Bernoulli(1742);
<integer 108...991 (3503 digits)>/6

Details

The exact combination of actions used to reproduce the problem are above are sufficient but not necessary to reproduce the problem. The use of flags -A and -m 64m are just so that relatively little memory is allocated to the GAP memory pool initially, and that any unnecessary packages are not loaded at startup, so that not much memory has already been allocated by the time Bernoulli is called.

The use of Bernoulli() in the example is also sufficient to reproduce the problem for two reasons: It memoizes its sub-results, so calling Bernoulli() with a large-ish argument results in eventual overflowing of the initial memory allocation. The other reason is that its computation uses large integer operations using GMP.

This bug was first introduced by #3335, which itself is a (still necessary) fix for different, more general problem. This bug is much narrower and very technical. I summarized it on the Slack channel thusly:

• The way mmap MAP_NORESERVE works on Cygwin is such that a virtual memory range is reserved by the OS, but it is not committed meaning in Windows terms that it does not reserve physical memory.

• This means that when you first try to access some page of memory in the range of a MAP_NORESERVE mmap it results in an access violation.

• Normally, Cygwin's exception handler catches these access violations, checks if it's in the range of one of its NORESERVE mmaps, and produces the required VirtualAlloc calls to mark the relevant page as committed. Execution resumes at the previously failed memory access and it works. Most of the time this works fine.

• However, it so happened that in this particular case some Bag was being allocated (in the mmap'd pool) at least parts of which (I'm not sure the details) are not accessed for the first time until somewhere in the bowels of one of GMP's assembly routines (in this case it is the first argument to mpn_divrem_1 which is the output argument, so it makes sense that parts of it might not have been accessed until then).

• Normally this should still work fine, but it happens that GMP's hand-written assembly functions do not implement the correct bits needed for Windows' exception handling to work properly. The result is that when the (expected) access violation occurs, stack unwinding does not work properly, and the Cygwin exception handler is never run. The process is terminated without any further exception handling.

Summary

The tl;dr is that GMP is the responsible party here, not GAP: It does correctly implement the bits needed for structured exception handling to work properly on Windows. I plan to make a bug report against GMP about this and will link to it here once I do.

However, the problem can be easily worked around in GAP by ensuring that all pages of memory used by a GAP Bag (either during or after a call to NewBag) are actually committed to memory before passing them to a GMP function.

@ChrisJefferson wrote:

I've thought about this some more overnight -- I'm tempted to just do it in NewBag -- maybe similar issues could arise in something else, and this way there is one place and we definitely fix the problem. Cygwin GAP is already slow for various reasons, so I don't mind too much if it gets a bit slower again

I agree with the temptation as adding it in NewBag would make this workaround fairly transparent. However, I think it could have a very noticeable performance impact, especially in contexts where many bags are being created, or when a very large bag is created, but not all of it used. This is a very narrow issue and not likely to arise in many other places: It is only when some memory is allocated to a Bag by NewBag that has not already previously been used, and that range of memory is passed as an argument to a hand-written assembly function that does not include the necessary SEH pseudo-ops. So I think a more focused approach would be better, though admittedly a wart.

fingolfin commented 5 years ago

This was resolved by PR #3435