OpenBSD support - Githubissues

gbluma commented 6 years ago

Just a thing to link commits to.

gbluma commented 6 years ago

What I find interesting about the OpenBSD port is that there are a lot of extra run-time checks for memory safety. Maybe it can lead to pinning down these garbage collection bugs.

Take the following backtrace from a segfault on flx_pkgconfig:

#0  0x05bd5137 in thrkill () at {standard input}:5
#1  0x05b9558b in *_libc___stack_smash_handler (func=0x38aab014 "j__udyInsWalk", damaged=5) at /usr/src/lib/libc/sys/stack_protector.c:79
#2  0x18b46be4 in j__udyInsWalk () from /root/Projects/felix/build/release/host/bin/flx_pkgconfig
#3  0x18b3dc19 in JudyIns () from /root/Projects/felix/build/release/host/bin/flx_pkgconfig

BTW, this segfault is triggered in a version of felix that works on Windows, Linux, and OSX.

skaller commented 6 years ago

You mean "appears to work" which is a different animal. Many small tests work because the GC is never triggered. You can force it to be triggered in two ways:

Set the environment variable FLX_FINALISE=1 and make sure to use the correct English, not American, spelling. This forces the GC to run just before the process terminates. By default it doesn't because this speeds up termination.
Set FLX_MIN_MEM=N to reduce the trigger point for the first GC to the N Megabytes. You can also set FLX_FREE_FACTOR=N.M where N.M is a floating point number telling Felix where to set the threshhold for the next GC after a collection. 1.1 says to set it at 10% more than the used memory, you get lots of GC's that way. Note, its the trigger after collection, not 10% above the previous trigger.

Now run the test suite or the build process and you gets lots more crashes.

gbluma commented 6 years ago

Of course, but I'm not talking specifically about triggered GC cleanup events here.

I mean, using both systems in the same configuration (i.e. not cleaning up garbage), one will run programs and the other will not. OpenBSD seems to have some special protection mechanisms that helps diagnose memory misuse, on insertion--which is where the bugs seems to be lurking.

I'm mentioning GC here because fixing these particular stack-smashing issues may help isolate why the GC is buggy later on.

skaller commented 6 years ago

Yes, but the question is: is this a bug in Judy, or is Judy OK and something else is corrupting its indices? In the latter case, the bug may be detected when Judy functions run. One may ask, why only Judy functions? The answer may be that it could be any function, however most Felix programs have a simple collection of linked heap objects, often almost none at all because the optimiser gets rid of them, whereas Judy is a digital trie with cache line sized objects and lots and lots of them appear very fast.

One may also ask, why insertion? Because every allocation causes insertions. There is no other Judy action until the GC is run. At that time it does lookups, scans, and removal of keys.

It isn't possible, in a Judy perspective, that two OS run the program in the same environment because Judy is managing machine addresses returned by malloc the shape of the Judy array tries is heavily sensitive to the actual values malloc returns. Which also depends on the exact binary code being run, dynamic linkage, and all the other system facilities that also use memory.

The key here is that the crashes we get are highly sensitive BUT they're quite determinate for the same binary on the same OS (because the process image is identical each time). Actually even a tiny change like setting an environment variable may matter because the program is ultimately run under shell in the same process as the shell.

Actually we could check this by simply running a loop a random number of times that mallocs some random amount of memory, before doing anything else: the random has to be really random though (seeded by the date and time or something). If some runs go and some crash, that tells us something but I don't know what or how it helps. Changing the GC parameters also changes the behaviour, but again, its not clear if Judy just bugs out earlier, or something else bugs it out.

The problem here is that it is not just the code that matters. Judy is driven by RTTI tables which are hand written for RTL objects and generated for the rest of the program by the compiler, and any error in any of them will screw up the GC. The RTTI is used on allocation to calculate how much store to allocate (n-objects x size). If it is too small an amount we get a corruption from ordinary Felix usage.

felix-lang / felix

OpenBSD support #106