felix-lang / felix

The Felix Programming Language
Other
803 stars 43 forks source link

Make Failure On flx_run #59

Closed fundamental closed 4 years ago

fundamental commented 10 years ago

'make build' fails on Slackware64 14.1 (ocaml 3.12.1, python 2.7.5, llvm/clang 3.5/3.3) I'm reasonably certain that this also occurs on the 32 bit version though it has been a while since I have checked. Prior to this error, the build appears to be moving along smoothly besides the screens of warnings about the _start_switch label being unused.

It dies after appearing to compile all of the ML files as well as parsing through all the grammar files.

*** Error inbuild/release/host/bin/flx_run': munmap_chunk(): invalid pointer: 0x0000000002001340 ***`

From the resulting coredump:

#0  0x00007fe6a1ecddd9 in raise () from /lib64/libc.so.6
#1  0x00007fe6a1ecf4e8 in abort () from /lib64/libc.so.6
#2  0x00007fe6a1f0f6e7 in __libc_message () from /lib64/libc.so.6
#3  0x00007fe6a1f15837 in malloc_printerr () from /lib64/libc.so.6
#4  0x0000000000406633 in flx::gc::collector::malloc_free::deallocate (this=<optimized out>, p=0x2001340) at build/release/share/src/gc/flx_collector.cpp:34
#5  0x0000000000406cbc in flx::gc::collector::flx_collector_t::delete_frame (this=this@entry=0x19371d0, fp=<optimized out>) at build/release/share/src/gc/flx_collector.cpp:288
#6  0x0000000000406d07 in flx::gc::collector::flx_collector_t::reap (this=this@entry=0x19371d0) at build/release/share/src/gc/flx_collector.cpp:297
#7  0x0000000000406ece in flx::gc::collector::flx_collector_t::sweep (this=this@entry=0x19371d0) at build/release/share/src/gc/flx_collector.cpp:412
#8  0x00000000004073e6 in flx::gc::collector::flx_collector_t::impl_collect (this=0x19371d0) at build/release/share/src/gc/flx_collector.cpp:558
#9  0x0000000000407dfd in flx::gc::collector::flx_ts_collector_t::v_collect (this=<optimized out>) at build/release/share/src/gc/flx_ts_collector.cpp:21
#10 0x00007fe6a1634d04 in flx::gc::generic::collector_t::collect() () from build/release/host/lib/rtl/libflx_dynamic.so
#11 0x00007fe6a1421dfe in flx::gc::generic::gc_profile_t::actually_collect() () from build/release/host/lib/rtl/libflx_gc_dynamic.so
#12 0x00007fe6a1421d8d in flx::gc::generic::gc_profile_t::maybe_collect() () from build/release/host/lib/rtl/libflx_gc_dynamic.so
#13 0x00007fe6a1421fd4 in flx::gc::generic::gc_profile_t::allocate(flx::gc::generic::gc_shape_t const*, unsigned long, bool) () from build/release/host/lib/rtl/libflx_gc_dynamic.so
#14 0x00007fe6a142234b in operator new(unsigned long, flx::gc::generic::gc_profile_t&, flx::gc::generic::gc_shape_t const&, bool) () from build/release/host/lib/rtl/libflx_gc_dynamic.so
#15 0x00007fe6a1c681ac in flxusr::flx_build_prep::_lam_36125::resume (this=0x30bf030) at /home/mark/.felix/cache/text/home/mark/felix/src/tools/flx_build_prep.cpp:797
#16 0x0000000000405499 in flx::rtl::fthread_t::run (this=0x193b5a0) at build/release/share/src/rtl/flx_rtl.cpp:118
#17 0x0000000000404b0b in flx::run::sync_sched::frun (this=this@entry=0x193b698) at build/release/share/src/rtl/flx_sync.cpp:209
#18 0x0000000000405df2 in flx::run::async_sched::prun (this=0x193b660, block_flag=block_flag@entry=flx::run::async_sched::block) at build/release/share/src/rtl/flx_async_world.cpp:137
#19 0x0000000000403145 in flx::run::flx_world::run_until_complete (this=this@entry=0x1937080) at build/release/share/src/rtl/flx_world.cpp:303
#20 0x00000000004025cf in felix_run (argc=9, argv=0x7fffeb836e48) at build/release/share/src/flx_drivers/flx_run.include:335
#21 0x00000000004026e9 in main (argc=<optimized out>, argv=<optimized out>) at build/release/share/src/flx_drivers/flx_run_main.cxx:4
skaller commented 10 years ago

I'm not sure where you're up to in the build. for fbuild phase, there's a log in

build/release/fbuild.log

Can you try again with make rebuild? Shouldn't make a difference. I've seen this error before (clearly its a case of freeing invalid memory). It happened when the GC was run during the build. Not being able to find the actual problem I think I just refactored the build so it didn't need the GC. You can also try increasing the GC threshhold with

FLX_MIN_MEM=n

with n in Megabytes. Starts at 10 according to flx --help.

This is hard to debug because I don't get this problem on my Mac (clang 3.3svn) or on Ubuntu with gcc 4.x

skaller commented 10 years ago

BTW: you need to work around the problem during the build to get a built system, since there's no way for me to debug a problem I don't have. Once we both have built systems we need to find a test case that fails.

skaller commented 10 years ago

There are several possibilities for the bug: in the GC, in the compiler (probably calculating union representations), in the Felix library (just about anywhere) or in the build application (just about anywhere). Previously what happened is it failed at exactly the same point with a clean build, but if I restarted so it skipped building things already built it failed the same way later. But the number of "indirections" involved here is enormous :-) I did some simple very heavy tests on the GC and it seemed to work (in other words I proved nothing).

fundamental commented 10 years ago

Well, it builds now. So it does look like a GC issue of some sort. with FLX_MIN_MEM=1000 I have all regression tests passing and all tutorial examples passing (one failure from the optional gmp test, but I'd assume that to be unrelated). Without the environmental variable it ends up segfaulting after the twelfth test (this point of segfault can be modulated with the environmental variable).

Peaking at valgrind I see a ton of reads of uninitialized values at the size of pointers (mostly within a judy datastructure). To me that doesn't really tell anything as I'd expect some of those depending on how the garbage collection algorithm is searching about.

I'm not seeing anything in fbuild.log that looks particularly useful at the moment. Usually there's some reasonable extra logs or details to post, but for remote debugging of a garbage collector, I'm at a bit of a loss.

skaller commented 10 years ago

Yes, but it may not be the GC itself. There are some tests of the GC, in actual GC tests it never fails, not that there are many of them. (It fails to be fast, but never breaks). You can write a program which takes a parameter and then makes a list of 4 elements, then assigns the variable to itself + itself in a loop to get exponential memory use and make plenty of garbage. Set FLX_REPORT_COLLECTIONS=1 and you will see the activations of the collector.

In the test suites "flx" is running all the tests one after the other, the tests aren't failing, it's flx that's failing. "flx", and almost all Felix code, is linked against "flx_run" or "flx_arun", and that's where the failure is. flx_run is the driver code (main event loop) written in C++ and has been extensively refactored only a few weeks ago. It too uses the GC. For example "fthreads" are holders of a "procedure stack" and both are garbage collected.

So I suspect a bug in flx_run. The top level of the code is in

src/flx_drivers/flx_run.include

and in

src/rtl
    flx_world, flx_async_world .....

The setup is a bit messy but is the way it is because we have just finished refactoring the driver code so that Felix can be embedded in an foreign event loop. However I think the bug we are seeing existed before. "debugging" the GC is more or less impossible (way way too many objects are handled by definition of "out of memory") so really peer code review is probably better.

The one thing I tried before was to start the GC with almost no memory to force an almost immediate GC. When I set it to the first or second allocation triggers the GC it fails. Give it just 1K bytes and everything works.

BTW: if you have the latest Felix it's interesting the coroutine test failing for me on OSX 10.6.8+clang3.3svn is passing for you: it works with clang -01 and gcc 4.08 on Ubuntu too.

skaller commented 10 years ago

Just to get an idea: the GC is exact on the heap and conservative on the stack. The heap allocations use Judy to associate objects with a RTTI record which uses an array of offsets to locate pointers in the object. The GC doesn't scan objects which don't have an RTTI record, i.e. it can tell the difference between a Felix pointer and a raw C pointer or integer. It can also tell if a pointer is interior to an object.

Felix code generally relies on the flxg compiler to create the RTTI objects and ensure allocations associate the right ones. However C++ library code has to do this manually, and so do any Felix library bindings to C++. So the GC can easily fail if there is a bug in the compiler, or any one of the C++ bindings in the Felix library to heap objects (eg RE2) are wrong, or, anything in the C++ run time library is wrong. And any Felix or C++ code can mess up with casting. And of course there is the usual C null pointer deref or writing in the wrong place creating a corruption.

And it's possible some untraced object is holding a pointer to a traced object which is reaped by the GC and then tries to use it: untraced objects have to add_root() to hold traced pointers. The driver does a lot of that.

skaller commented 10 years ago

Above patch may help in the test process. A similar operation added to the failing build tool may help too. Would be nice if Mark can test since I don't get this failure on OSX or Ubuntu. Use

rm -rf test tutopt tut make test

to get a complete rerun.

fundamental commented 10 years ago

The tests still fail with lower memory values and interestingly enough I see increasing memory usage from the garbage collector's report when it does manage to run multiple times during 'make test'. I'm not sure if that's to be expected or not (perhaps 30% increase from gc around test 4 to around test 8 IIRC).

Another thing that surprised me was that under some values of minimum memory, is that it appeared possible for the garbage collection to get stuck (for at least a large number of seconds compared to the typical run time under a second). I haven't quite localized which line/region it gets stuck in, but when I did dump a few quick backtraces it was all gc routines calling into judy routines.

My mid week schedule is a bit more hectic than originally expected so I don't have the backtraces/logs to post at the moment. Unless the gc() has some huge variance to it I'd expect whatever is causing the gc to take an abnormal amount of time (possibly never terminating) to be a bug of some sort related or not. The other possibility is that some datastructure has already been corrupted at these program points.

I'll hopefully update in a day or two once I can organize the information a bit.

skaller commented 10 years ago

The GC uses Judy heavily. If the initial memory is too small -- and I mean a few K .. the system is bound to fail, since some of the fundamental data structures are GC allocated. However I have seen a problem with it failing on low memory too, but only when I REALLY pushed it, in the old days (so it did an GC on the second allocation :)

The GC is a standard naive mark/sweep GC. It does precise scans on heap objects (using RTTI to find pointers), and conservative scan on the machine stack. On x86_64 it is possible, though not likely, the "Red Zone" is causing problems, if setjump() doesn't know about it. (The red zone is 256 bytes (or something) I think, PAST the end of the stack, and may be used for temporary storage)

skaller commented 10 years ago

Ah interesting ..

FLX_REPORT_COLLECTIONS=1 FLX_MIN_MEM=0.1 make rebuild

actually collected 15774 objects, still allocated: 79 roots, 91311 objects, 8934698 bytes tools/flx_gramdoc -> flx_gramdoc Actually collect actually collected 16050 objects, still allocated: 80 roots, 92338 objects, 9034970 bytes tools/flx_libindex -> flx_libindex Actually collect actually collected 16399 objects, still allocated: 81 roots, 93114 objects, 9104536 bytes tools/flx_libcontents -> flx_libcontents Actually collect

The number of roots is growing. Has to be a bug.

skaller commented 10 years ago

The GC uses a really crud tuning algorithm. Starting from X amount of memory, when that is exceeded it does a collection, then the new threshold for a collection to trigger again is set to A * F where A is the amount of allocated memory after the collection, and F is the FLX_FREE_FACTOR. The default is 1.1, so you get about 10% extra free space over your reachable allocated memory.

This algorithm is not very good. If you set it to 10.0 for example you'd think the number of collections would reduce a lot, but that doesn't happen because if you happen to, or deliberately collect, at a point where you have a lot of garbage, you will end up with only 10 times the amount of allocated memory, which would be quite small.

FLX_MIN_MEM, on the other hand, is an absolute minimum for the threshold and doesn't change (although it is the initial threshold value).

skaller commented 10 years ago

The Felix call Gc::collect() does not appear to work! So the changes I made have no impact. However I have seen this:

Processing [11/127]: churn_01.flx Processing [12/127]: closure_constructors-01.flx Processing [13/127]: constraint_propagation-01.flx Processing [14/127]: coroutines-01.flx [flx_gc:gc_profile_t] actually_collect sh: line 1: 17650 Segmentation fault "env" "DYLD_LIBRARY_PATH=build/release/host/lib/rtl:$DYLD_LIBRARY_PATH" "build/release/host/bin/flx_run" "/Users/johnskaller/.felix/cache/binary/Users/johnskaller/felix/test/regress/rt/coroutines-01.dylib" "test" > "/Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/test/regress/rt/coroutines-01.stdout" Processing [15/127]: cstruct_constructors-01.flx

and what I assumed was a fault in coroutines with no optimisation on clang appears instead to be a GC fault!

skaller commented 10 years ago

Argg .. Gc::collect() does a service call which isn't allowed in a function. Archaic .. collection can be done without a service call now. Will fix that one.

skaller commented 10 years ago

I have fixed Gc::collect(), so it now runs when doing batch processing, after every call. This may or may not hide the segfault Mark (fundamental) gets with clang on Slackware. If it does I'll try it in the build system too.

The coroutine error above is a definite bug in clang handling assembler labels. Nothing to do with the Gc as far as I can tell. It doesn't occur with static linkage, and it doesn't occur if assembler labels are disabled even with dynamic linkage. Given how clang uses SSA form and single entry blocks with phi-node exits via LLVM its hardly surprising advanced control structures don't work. LLVM is not low level enough.

fundamental commented 10 years ago

I'm seeing it lock up a lot more than the previous commit and now it looks like make test does not finish even with a gig of memory. If this is a clang specific bug and clang is known to as a whole produce code less reliably for this codebase, then why does the build selection select it when a usable copy of gcc (4.8.2) is sitting around?

Assuming I can replicate things properly I should have the gdb backtraces from when it gets stuck posted shortly.

fundamental commented 10 years ago

Ok, the backtraces that I'm observing with no environmental variables: http://pastebin.com/u6fXajcG once again I can modulate when this failure happens by changing FLX_MIN_MEM. Last time I turned on full gc debugging output I ended up with a 2.9 Gig file as a result of it getting stuck.

skaller commented 10 years ago

There are two build systems. The first one, fbuild, written in Python by Erick Tryzelaar is used to make a bootstrap version of Felix and establish the default configuration. The second one is written in Felix.

The reason for the Python build given the Felix build is clear: bootstrapping. The reason for not just using the Python build is that Erick left the project and I found it too hard to maintain. In particular there is a design bug in fbuild where it selects the compiler based on the operating system and what's available (fair enough) but the choice of switches used is made separately on the same basis, instead of based on the compiler selected. This is so deeply embedded in the code I have no idea how to fix it.

With fbuild you can choose the compiler. The available flags are described with

fbuild/fbuild-light --help

I think you can use

fbuild/fbuild-light --build-cc=gcc --build=cxx=g++

in the Makefile to hack it to use gcc. There is no Felix tool to guess the configuation at the moment, it just uses the Python/fbuild generated one.

The reason clang is preferred over gcc is simple: my platform is OSX 10.6.8 which has gcc 4.2 and whatever clang i build. Apple will not support any further upgrades to gcc and it isn't possible to build your own. So on my platform I have both compilers and I want to use clang and so it has to be preferred or i have to use the switches. One problem with fbuild is you then have to say the switches on every use or it reconfigured to the guess. That's unimportant today but before the Felix build code existed it was a serious pain for me.

I'm trying to get rid of fbuild and Python, which can be done if you can start with binaries from a tarball, or if I generate bootstrapping code from Felix to C++ and then compile the C++ as the bootstrap. In the meantime we're stuck with it to bootstrap the system.

Once you have built Felix you can reconfigure the system to use gcc and rebuild it. In fact you can set it up to cross compile to any number of platforms.

skaller commented 10 years ago

So you're saying there's an infinite loop in scan_object which happens to cause Ctrl-C to kill the GC inside JudyLPrev each time. You do have 64 bit Judy on a 64 bit machine, is that right?

If you have a recent clang it might be the same issue gcc has which Felix fixes: a heavy optimisation that breaks stuff. Felix uses --no-strict-aliasing and for gcc 4.8 it has to also turn off aggressive loop optimisations. So this might not be a bug in clang, but a failure to disable an aggressive optimisation.

Whilst scan_object is a bit tricky, the GC runs numerous times during operation on OSX with my clang and Ubuntu using gcc 4-8.2 without crashing and scan object is therefore called millions of times without appearing to fail. Proves nothing but suggests a configuration issue rather than a straight out coding error.

fundamental commented 10 years ago

It looks like there is an infinite loop involving scan_object, and the SIGINTs are from me interrupting things such that I can dump another backtrace with gdb. The only libjudy that appears on my system is the one contained within the felix build and it is a 64 bit library build. I wouldn't necessarily guarantee that there isn't any weirdness due to clang/clang's version, but it does strike me as odd that the gc appears to hang.

skaller commented 10 years ago

Note I can build a working gcc 4.8.2 based 64 bit system on Ubuntu and tarball it and i would guess that should run on any 64 bit Linux distro with the same compiler.

Otherwise there are ways to make the Felix rebuild the target "host" for gcc 4.8.2. The only problems with that are that you need a working Felix first, and if you ever need to rebuild from scratch you will have to repeat the manual build hackery. Once you have an installed working Felix it should not be necessary, "in theory" to ever run the Python build system again, but in practice a couple of things go wrong: if the build tools themselves need modification to build the latest version, then the latest version of the build tools might not build using an older version. This can happen if the C++ RTL is refactored, for example, because some operations are hard coded in the tool.

Ultimately Felix generates C++ which is the same for all platforms of the same "kind", i.e. Felix library code can "conditionally compile" based on the OS, but detailed configuration is done at the C++ level not in Felix. Therefore in theory I can generate a working system for say Linux, with a single set of C++ files and only have a couple of configuration headers you need to edit to build it all, without needing anything other than a working C++ compiler (and Ocaml of course).

I intend to make a configuration tool. In the meantime I'm afraid you have to hand edit a couple of files to set the C MACROS as required, and play a bit with the Makefile to select the right compiler etc. The tool should fix all that eventually, however it really needs to be GUI based to be usable and I think that the only sane GUI to use would be your web browser. Which means the implementation would be on the server as well which means you have to have a working Felix and flx_web web server before you can play with configuration. All very chicken and egg.

skaller commented 10 years ago

I would not be surprised that there is a bug in the GC if I'd just written it, it's complex, messy, involves nasty casts, etc. What seems surprising is that it works in other contexts. If there is a bug, why does it manifest for you but not me? If it's a coding error, I could be getting corrupted memory too, but not seeing any consequences.

Scan_object is a bit messy as follows: the scanning is done with an RTTI based scanner, so it can handle foreign data types by using a custom scanner. The default scanner uses an array of offsets in the RTTI to find pointers. Now, the standard algorithm is a recursive descent with a marker saying you've visited this object before to break recursions. Because of the abstraction involved handling each pointer one at a time is expensive, so I have used a buffer. This is particularly important scanning arrays. Instead of recursing immediately on each pointer, the algorithm stashes the pointer in a small buffer until it is full, THEN it recursively examines each pointer in the buffer.

So if you look at the scan_object code and wonder what the heck is going on, that's the answer. The code has to be decoupled due to custom scanners and the buffers are there to put the performance back. So managing the buffers adds to the complexity and there may be a bug.

[There is definitely a bug in something, continually adding roots, but that cannot be in the Gc itself, since it doesn't add roots (just manages the roots that are there). So some other RTL function is adding roots but forgetting to unroot the object. That should be easy to find by making the debug routine say what type the roots are. I think this is a new problem added during the refactoring done very recently, it's not a show stopper, just a not quite finished job]

skaller commented 10 years ago

Aha .. now i can get a segfault myself!

~/felix>FLX_REPORT_COLLECTIONS=1 FLX_MIN_MEM=0.1 make test

...

[flx_gc:gc_profile_t] actually_collect actually collected 882 objects, still allocated: 4 roots, 6853 objects, 614274 bytes [flx_gc:gc_profile_t] actually_collect actually collected 1061 objects, still allocated: 4 roots, 6792 objects, 614322 bytes [FLX_REPORT_COLLECTIONS] Collection report enabled [Gc::collect] Program requests collection [flx_gc:gc_profile_t] actually_collect actually collected 5194 objects, still allocated: 4 roots, 6786 objects, 612044 bytes [Gc::collect] Collector collected 5194 objects Processing [27/127]: forall-01.flx [flx_gc:gc_profile_t] actually_collect actually collected 1476 objects, still allocated: 4 roots, 7038 objects, 622870 bytes [flx_gc:gc_profile_t] actually_collect actually collected 848 objects, still allocated: 4 roots, 7221 objects, 639146 bytes [flx_gc:gc_profile_t] actually_collect actually collected 847 objects, still allocated: 4 roots, 7374 objects, 653222 bytes [flx_gc:gc_profile_t] actually_collect actually collected 905 objects, still allocated: 4 roots, 7469 objects, 663314 bytes [flx_gc:gc_profile_t] actually_collect actually collected 849 objects, still allocated: 4 roots, 7620 objects, 680286 bytes [flx_gc:gc_profile_t] actually_collect actually collected 930 objects, still allocated: 4 roots, 7690 objects, 687770 bytes [flx_gc:gc_profile_t] actually_collect actually collected 818 objects, still allocated: 4 roots, 7872 objects, 705646 bytes [flx_gc:gc_profile_t] actually_collect actually collected 880 objects, still allocated: 4 roots, 7992 objects, 719246 bytes [flx_gc:gc_profile_t] actually_collect actually collected 883 objects, still allocated: 4 roots, 8109 objects, 731342 bytes [flx_gc:gc_profile_t] actually_collect actually collected 1099 objects, still allocated: 4 roots, 8084 objects, 737496 bytes make: *\ [regress-check] Segmentation fault ~/felix>

When I set FLX_MIN_MEM=0.01 instead of 0.1 it doesn't crash.

fundamental commented 10 years ago

If there is a bug, why does it manifest for you but not me?

That's a question that a lot of programmers spend quite some time thinking about. With C++ the most frequent answer that I've seen is subtlety undefined/illdefined behavior.

and I just saw the most recent message... Hurray segfaults for everyone!

skaller commented 10 years ago

First time I have to agree getting a segfault is a GOOD thing :-) Now I have to make it repeatable by rm -rf test so we're always re-running all the tests from scratch.

skaller commented 10 years ago

Now, I think I might add documentation comments to the GC in an effort to find the bug and also help with peer review. Can't think of what else to do .. any suggestions welcome! Code that works "most of the time" is hard to deal with (my segfault just went away .. :)

However it's almost always a bug at a boundary condition.

skaller commented 10 years ago

So here is a potentially infinite loop (with newly added comments):

// Now, scan the temporary set in j_tmp until it is empty // When we're processing an object with scan_object // if its an actual Felix object we mark it reachable // and then scan all the pointers in it: usually those pointers // are not scanned immediately by scan object but simply put // into the set j_tmp to schedule them for scanning. // // Note: Judy1First finds the first key greater than or equal to the given one, // it returns 0 if there is no such key. Word_t toscan = 0ul; int res = Judy1First(j_tmp,&toscan,&je); // get one object scheduled for scanning while(res) { Judy1Unset(&j_tmp,toscan,&je); // remove it immediately scan_object((void*)toscan, reclimit); // scan it, it will either be marked or discarded toscan = 0ul; res = Judy1First(j_tmp,&toscan,&je); }
assert(j_tmp == 0);

What this code does is take a set of pointers in j_tmp, grab one of them, remove it from j_tmp, then scan it. The scanning process checks if it is a Felix pointer and if not does nothing.

If it is, it checks if it is already marked reachable. If so it does nothing.

If it is not marked reachable, it is marked reachable. Then all the contained pointers are examined, which may add entries to the j_tmp collection.

I claim this process must terminate. Evidence is otherwise. My proof is that the scanner never follows a pointer path more than once. So only a finite number of elements can be added to the set. Since each iteration removes one, it must terminate.

Note that scan_object CAN and will be passed the same pointer many times. Although the j_tmp array cannot hold more than one copy of a given value, it can be removed and then put back later. However this doesn't matter, because once the pointer is marked reachable no pointer contained it it will be added to the j_tmp buffer set (at least through THAT scan_object call).

fundamental commented 10 years ago

(bit of a brain dump after reading it once) I'll just restate things as I'm reading them to come up with a list of assertions that if verified would indicate something outside of the code is going wrong.

mark() all objects exist in j_shape j_shape consists of objects with LSB defining reachability

scan_object(x) all objects reachable from x are either marked in j_shape or placed in the work queue j_tmp

I would expect scan_object(x) to return immediately if it is already marked which seems to be taken care of in *ppshape&1UL)==reachable , though perhaps not I would also expect that already marked objects would not be added to the work list, though it is a somewhat redundant check. (down by if(reclimit==0)) Though there might be some larger context to it, at first glance letting the marked flag be either 0 or one via parity seems odd. So, asserts that need to likely be made are:

skaller commented 10 years ago

All heap objects exist in j_shape.

Already marked objects ARE added to the work list. The reason is that the code to calculate if an object is interior to a Felix heap object is non trivial, and constitutes the bulk of what scan_object does. Scan object then continues (if the pointer is interior and the object isn't marked reachable already) to call the scanner.

This could be refactored. The lookup to find the type of an object already exists somewhere else, so that could be done before calling scan_object, but given an object is not yet examined and has to be scanned, scan object then wouldn't do anything except call the scanner :) It's not clear how the best performance would be obtained.

The latest commit explains why the reachability flag oscillates. It saves a complete, cache killing pass over all the object just to mark them garbage, then a second pass over all the reachable objects to say they're not garbage. If we just reinterpret the "not garbage" flag to mean garbage AND we assume all the garbage found last time is deleted, then we save that initial garbage marking pass by saying all the previously found reachable objects are garbage (as well as all new ones added since). So now everything is garbage we only have to mark non-garbage. I.e. a two pass algorithm is reduced to a single pass.

So: yes, more than one instance of x can be put into j_tmp over time. However at any given time there's only one possible, that's a property of Judy1Array (it's an ordered set of pointers although the name suggests a map from pointer to bool, the false case is just "missing from set").

Already marked pointers may go into j_tmp. It's just a cache of pointers we haven't considered yet. So we don't know if they're marked yet.

get_used would require checking. In particular not the calculation as such, but the argument pass to it has to be the head (lowest address) of the heap object I think.

You can pass crap into scan object. It shouldn't matter. If that crap "happens" to point into a Felix heap object it is fine, its considered reachable, and the object may "leak", i.e. not get deleted. The big issue is if you FAIL to pass a reachable pointer in, then you might delete something live. This can happen easily if the compiler generates the wrong list of offsets for the object, or the library programmer manually associating a type with an object messes up (which I have done before). Or it can happen if I got the RTTI objects for RTL (C++ run time library) wrong.

IF an object is on the Felix heap THEN there must be an accurate associated shape and array lengths (there are two dynamic lengths for every object: the total length and the number of actually used slots. If the Judy arrays storing these lengths are missing a pointer, it means the value is 1).

Uninitialised reads can happen on the stack. But they can also happen if a Felix object doesn't initialise a pointer. The compiler tries to ensure all pointers are initialised to 0. However a wrong pointer cannot cause a crash or corruption or other fault (unless of course the client code actually dereferences it but that's nothing to do with the GC, that's a client bug).

fundamental commented 10 years ago

It does look like everything should terminate fine, but it doesn't look like it is terminating. I guess the next useful debugging info I can produce is the contents/size of jl_tmp as with respect to time.

skaller commented 10 years ago

Ouch .. that would be hard to analyse. If there's really an infinite loop, where is it?

Of course in theory here is an infinite loop: you have X allocated and reallocation factor 1.1. You want to allocate X * (0.1 + delta). You're screwed. Every attempt to do the allocation triggers a GC which then sets the threshhold to the same inadequate value.

But I doubt that's the problem. There's an obvious fix too.

skaller commented 10 years ago

Extra invariant to check, possible bug!

Felix assumes the allocator will always return a 4 byte aligned pointer. malloc() is the default allocator and should do this.

Code can and does use this property, to put flags etc in the low two bytes of some pointers. That code has to mask the bits out before dereferencing of course.

The GC doesn't know which pointers might have these bits used for such purposes and should always mask them out. It is safe to do this because heap pointers are always at least 4 byte aligned, and so there cannot be a valid pointer less then 4 bytes distant from any other.

However its is not safe to NOT do it! If the GC fails to do this, it may have a pointer to the fourth byte after a one byte value, eg a pointer to a single heap allocated char. Although the allocator will probably allocate 4 bytes here, the Felix GC will record the length as one byte and judge that the pointer with the trick bits in it is NOT interior to the object.

This may cause the object to be freed because it is considered unreachable.

One should note carefully Felix does NOT allow "one past the end" pointers. It's fine for C objects, and its fine if there is another pointer into an object, but if you have only a single pointer into a Felix heap allocated buffer if you go past the end the object is unreachable. You're not allowed to move such a pointer back.

Felix "for" loops take special care never to overrun ranges.

skaller commented 10 years ago

This works:

FLX_REPORT_COLLECTIONS=1 FLX_FINALISE=1 FLX_MIN_MEM=0.01 FLX_DEBUG_DRIVER=1 build/release/host/bin/flx --t est=build/release test/regress/rt/list-01

but this segfaults:

FLX_REPORT_COLLECTIONS=1 FLX_FINALISE=1 FLX_MIN_MEM=0.01 FLX_DEBUG_DRIVER=1 build/release/host/bin/flx --t est=build/release --indir=test/regress/rt/ --regex='list-01.flx'

The two commands should more or less do the same thing. This comparison isolates the bug to a few lines of code in the top level of the "flx" command. Regexps, iterators are used in the second one but not the first. The crashing command crashes before launching the program:

Processing 1 files Segmentation fault

but the correct sequence is:

Processing 1 files Processing [01/01]: list-01.flx [FLX_REPORT_COLLECTIONS] Collection report enabled

which is what happens on the same command with more memory. So its pretty tightly isolated where the crash point is (though not the cause).

skaller commented 10 years ago

Well .. another test i get a segfault on a forced rebuild (flx does more work), but it works for --static. The mystery deepens .. i have put a diagnostic for out of memory forcing a GC and can't seem to trigger it.

skaller commented 10 years ago

Ok, so i'm going to mess a bit with the triggering algorithm for the GC. Currently there's some messing about with GC_FREQ=1000, meaning a collection and threshhold reset is done every 1000 allocations. Its also done on out of physical memory (malloc returns 0), or "would be over threshhold". The latter seems wrong because there's no assurance after the GC and threshhold reset that it still won't be. I think I will make the threshhold drag a bit more. Reset at the moment is max (FLX_MIN_MEM, allocated * FREE_FACTOR) which means a program with oscillating memory requirements will do a lot of GC's when expanding and some contracting (once every 1000 allocations).

I'm also looking at adding memory profiling, that is counting allocations and deallocations per type. Some messing about will be required with plugins because they have their own private RTTI lists.

skaller commented 10 years ago

None of this will solve the bug but I'm homing in on it :)

fundamental commented 10 years ago

Yeah, even when I first saw this behavior I wasn't expecting it to be a simple fix. I'll wish you luck on tracking the issue down as my that's likely the best help that I can be at the moment.

skaller commented 10 years ago

Almost Infinite loop here:

~/felix>FLX_REPORT_COLLECTIONS=1 FLX_FINALISE=1 FLX_MIN_MEM=0.1 FLX_DEBUG_DRIVER=1 build/release/host/bin/flx --te st=build/release --static --indir=test/regress/rt/ --regex='list-01.flx' ... prun: Before running: Sync state is Next fthread pos [sync] frun: entry ft=0x0, active size=1 [flx_gc:gc_profile_t] Threshhold 0 would be exceeded, collecting [flx_gc:gc_profile_t] actually_collect actually collected 0 objects, still allocated: 2 roots, 5 objects, 240 bytes [flx_gc:gc_profile_t] New Threshhold 264 [flx_gc:gc_profile_t] Threshhold 264 would be exceeded, collecting [flx_gc:gc_profile_t] actually_collect actually collected 0 objects, still allocated: 2 roots, 6 objects, 280 bytes [flx_gc:gc_profile_t] New Threshhold 308 [flx_gc:gc_profile_t] Threshhold 308 would be exceeded, collecting [flx_gc:gc_profile_t] actually_collect actually collected 0 objects, still allocated: 2 roots, 10 objects, 760 bytes [flx_gc:gc_profile_t] New Threshhold 836 [flx_gc:gc_profile_t] Threshhold 836 would be exceeded, collecting [flx_gc:gc_profile_t] actually_collect actually collected 1 objects, still allocated: 2 roots, 12 objects, 808 bytes [flx_gc:gc_profile_t] New Threshhold 888 [flx_gc:gc_profile_t] New Threshhold 888 [flx_gc:gc_profile_t] Threshhold 888 would be exceeded, collecting [flx_gc:gc_profile_t] actually_collect actually collected 0 objects, still allocated: 2 roots, 14 objects, 1260 bytes [flx_gc:gc_profile_t] New Threshhold 1386 Segmentation fault As you can see FLX_MIN_MEM=0.1 is interpreted as 0. So the collector starts with a zero threshhold. It allocates 10% more than actual use, so close to a collection every allocation. Very slow startup. And this is with my new algorithm.

It still segfaults. but not inside the collector. Here's more clue: [gc] Scan object 0x7fff5fbff6b8, reachable bit value = 1 [gc] DONE: Conservate scan of memory 0x7fff5fbff0d8->0x7fff5fbff6b8 [gc] Scanning roots [gc] Scanning root 0x100500760 [gc] Scan object 0x100500760, reachable bit value = 1 [gc] Scanning root 0x1005007d0 [gc] Scan object 0x1005007d0, reachable bit value = 1 [gc] MARKING object 0x1005007d0, shape 0x100225098, type=flx_libinst [gc] Scan object 0x100500870, reachable bit value = 1 [gc] Scan object 0x100500810, reachable bit value = 1 [gc] Scan object 0x100500710, reachable bit value = 1 [gc] MARKING object 0x100500710, shape 0x100225038, type=flx_dynlink_t [gc] Done Scanning roots [gc] Collector: Sweep, garbage bit value=0 [gc] Reachable 0x100500710=flx_dynlink_t [gc] Reachable 0x100500760=fthread_t [gc] Reachable 0x1005007d0=flx_libinst [gc] Reachable 0x100500810=init [gc] Reachable 0x100500870=thread_frame_t [gc] Reachable 0x1005009c0=char [gc] Reachable 0x1005009e0=addsymbols [gc] Garbage 0x100500a50=char [gc] Garbage 0x100500a80=_a4583t_40725 [gc] Garbage 0x100500a90=add_symbol [gc] Garbage 0x100500d40=_a10674t_40722 [gc] Garbage 0x100500d50=char [gc] Reachable 0x100500d70=char [gc] Reachable 0x100500ee0=add_symbol [gc] Sweeped 5 [gc] Free 0x100500a50 [gc] Free 0x100500a80 [gc] Free 0x100500a90 [gc] Free 0x100500d40 [gc] Free 0x100500d50 [gc] Reaped 5 objects [gc] Still allocated 9 objects occupying 752 bytes Thread 7fff70c98cc0 Restarting world [gc] FINISHED collect, thread 7fff70c98cc0 actually collected 5 objects, still allocated: 2 roots, 9 objects, 752 bytes [flx_gc:gc_profile_t] New Threshhold 827 [gc] Malloc 8 bytes, address = 0x100500d40 [gc] Allocated 0x100500d40, shape=0x100224ad8 = new _a10676t_40721 [gc] Malloc 30 bytes, address = 0x100500d50 [gc] Allocated 0x100500d50, shape=0x100224b28 = new char Segmentation fault Crashes at the same point with --static and without.

skaller commented 10 years ago

And here's the crash:

Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000006 0x0000000100137a64 in JudySLIns (PPArray=0x6, Index=<value temporarily unavailable, due to optimizations>, PJError=0x100500f40) at build/release/share/src/judy/src/JudySL/JudySL.c:408 408 if (*PPArray == (Pvoid_t)NULL) // no next JudyL array.

skaller commented 10 years ago

And this occurs in flx, in routine add_symbol, which is used to simulate dynamic linkage of a plugin when the code is static linked. That in turn uses a strdict (string to pointer associative array) implemented by a JudySLArray mapping module and symbol names to addresses: the registry array is owned by the garbage collector (it has nothing to do with GC, it was just a convenient "global" object to attach it to :)

So the bug is either in add_symbol, or in strdict (or in the JudySL scanner used by the GC).

What's more it happens I can demonstrate this:

FLX_REPORT_COLLECTIONS=1 FLX_FINALISE=1 FLX_MIN_MEM=0.1 build/release/host/bin/bootflx --test=build/release --static --indir=test/regress/rt/ --regex='list-01.flx'

This command works. The difference between flx and bootflx is that bootflx includes everything with an include directive, whereas flx uses static linked plugins. And thus it is using the dynlink, and add_symbol but bootflx is not.

So now I'm on the warpath. The problem isn't isolated to one line of code but it will be found. Look like something trivial: an extra or missing level of indirection handling JudySLArray. [Judy use is full of casts left right and centre]

skaller commented 10 years ago

Aha but no, by unspoken fear seems right. The JudySLArray here is a strdict (string dictionary). It is attached to the GC. It is not garbage collected.

And here was me thinking, well that's safe! Ah no, but it isn't: its wrong on two counts:

proc add (x:strdict) (key:string) (value: T) { var err: JError_t; var slot : && T; JudySLIns (repr x, key.cstr, &err, C_hack::cast&&word); slot <- new value; }

First, see that "slot" has "new value" put in. The value is copied onto the Felix heap, i.e. into storage allocated by the GC!

And second, key.cstr:

// Entirely unsafe because string could be an rvalue. fun _unsafe_cstr: string -> +char = "((char*)$1.c_str())" is atom;

// partially unsafe because the string could be modified. fun stl_begin: &string -> +char = "($1->c_str())" is atom; fun stl_end: &string -> +char = "($1->c_str()+$1->size())" is atom;

// this operation is entirely safe because the char array // provided is copied (and garbage collected) fun cstr (x:string) : +char => (varray[char] x).stl_begin;

so the string is ALSO garbage collected. Now that one seems safe but unnecessary because it is only used whilst Judy is creating the key and is on the stack so it is reachable. Actually an unsafe_cstr would be fine if the argument string were on the stack, since the GC wouldn't know about it. If the argument is on the heap, its safe too, because the GC cannot start until the current thread lets it. If we ever forcibly stop a thread this wouldn't work (can be done on Linux with a signal).

So its the first case I suspect. Bug candidate 1. Its not an error in any of the Judy or Strdict code. but it is a lack of indirection. The actual module_registry is just a void * in the GC object. This gets cast to a JudySLArray.

fun get_module_registry: 1 -> registry_t = "&(PTF gcp->collector->module_registry)";

That's the address of the slot in the GC. So all the typing is right, but the GC is not tracing that pointer! So the addresses stored in the array will be reaped by the GC. CRASH. Especially as what is stored is a pointer to a code address. Segfault!

Why doesn't this happen normally, or only late in a run of flx? Because the code addresses are normally only used once at startup. So even if the GC screws them it will be after their use. but this isn't so if one is looping around, continually recreating the frames for static link plugins... which happens in flx with a regexp argument! And in the build system.

It all fits. The solution is to add the thing as a root, but to do that we need a stable heap allocated address, so we actually do need to add one level of indirection somehow. Either that, or rewrite the code to use a pure Judy Array instead of a strdict. Sounds like work..

The only problem adding the root is that now we have to actually create the dictionary in Felix code.

skaller commented 10 years ago

Well I think i have fixed this, but now I get a crash at a different location,

Processing 1 files

Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: 13 at address: 0x0000000000000000 0x00000001001347fc in flx::gc::generic::gc_profile_t::allocate (this=0xd000100506470, shape=0x10021a910, count=1, allow_gc=true) at build/release/share/src/gc/flx_gc.cpp:90 90 if (amt + collector->get_allocation_amt() > threshhold && allow_collection_anywhere && allow_gc) (gdb) bt

0 0x00000001001347fc in flx::gc::generic::gc_profile_t::allocate (this=0xd000100506470, shape=0x10021a910, count=1, allow_gc=true) at build/release/share/src/gc/flx_gc.cpp:90

1 0x0000000100134d7f in operator new (amt=160, gcp=@0xd000100506470, shape=@0x10021a910, allow_gc=true) at build/release/share/src/gc/flx_gc.cpp:213

Die: DWTAG (abbrev = 66, offset = 10236) has children: FALSE attributes: DW_AT_type (DW_FORM_ref4) constant ref: 10249 (adjusted) Die: DWTAG (abbrev = 66, offset = 10236) has children: FALSE attributes: DW_AT_type (DW_FORM_ref4) constant ref: 10249 (adjusted) Dwarf Error: Cannot find type of die [in module /Users/johnskaller/felix/build/release/host/bin/flx]

skaller commented 10 years ago

Linux build crashes at exactly the same point.

skaller commented 10 years ago

Cause of problem isolated and possibly the bug is found! (Still rebuilding to check).

The problem is like this: when you have an abstract type, it comes with its own constructors in C++ bindings, and generally whilst a value can be copied on the heap, the object itself can't be scanned for interior pointers because Felix knows nothing about this.

Judy arrays are a good example.

To fix this, you can provide a custom scanner. Judy array types have these.

Typically, however, these objects are immobile. Judy arrays, RE2 objects, etc, don't come with copy constructors. Felix does NOT allow such types in general use.

So what you do is use a pointer as the abstract type and make the C++ constructor heap allocate the immobile object.

However there's a special case: an abstract type which is a pointer to the heap. Indeed, that special case is required for the previously mentioned immobile types generally, because, for example, the finaliser (destructor) of a Judy Array or RE2 object is not the same as the (trivial) finaliser for a pointer.

That special case can be indicated with the property _gc_pointer. It says "this abstract type is actually a pointer, possibly to a Felix heap allocated objects". This tells the garbage collector to treat any object containing such a types the same way as a pointer: trace it.

Now, if you heap allocate a _gc_pointer you have an ordinary Felix pointer to a _gc_pointer. Which is what I was doing with the symbol dictionary. In that case the _gc_pointer itself needs a shape, with a single pointer at offset 0.

And there is a bug. The compiler wasn't putting the offset array into the shape object, so the pointer wasn't traced, so what it pointed at could become unreachable.

If you run code with FLX_MIN_MEM=0.0 then at the start, every allocation will trigger a garbage collection which immediately reaped the symbol dictionary. This made the pointer to it dangle. This happened even though that pointer was a root, because it just pointed at another pointer, which it turn pointed at yet another pointer, and the on in the middle wasn't being traced so the one on the end got deleted.

This would screw up ALL heap allocations of _gc_pointers, not just this one, that would include RE2 regexps.

So basically: Strdict[address] works fine but a pointer to it doesn't, because it is actually a _gc_pointer to a pointer value which is the JudySLArray.

Of course it remains to be seen if this fixed my code. This bug did NOT cause the original fault leading to this PR. That was cause by mismanagement of the library/instance objects. The fix to that is incomplete and incorrect too, but statically linked plugins also require the symbol table to be traced and the fix that was correct, but the Felix code wasn't compiled to C++ correctly. Hopefully now that is fixed too :)

skaller commented 10 years ago

Ok, the latest commit should work better. I have it running some programs with FLX_MIN_MEM=0.1.

It still fails with FLX_MIN_MEM=0.0 which means there are still some issues to resolve: running with zero initial memory is a hard test and should uncover memory management issues. So I'm persuing that.

AFAIK the GC itself works correctly: I found a fault in one of the shape tables used to tell the collector where to find pointers in an object, which is fixed, and some of the bindings, which weren't adjusted for the new way to handle dynamic linkage. There's still an issue with that (the DLL representative really needs to be made a root). It's all about ensuring live object remain reachable via pointers the GC can trace, in Felix code, in bindings to C++ in the Felix library, and in the C++ in the run time library. As well as across DLL boundaries etc.

If all the paths are not properly represented, objects which are reachable can be freed. Debugging faults in this area is very hard.

skaller commented 10 years ago

[master 377ed65] Get the dynlink module to respect the GC properly.

Seems to fix this fully. In particular

FLX_MIN_MEM=0.0 make test

runs on my box without segfault (except coroutines-01, which is a bug in clang). Mark, perhaps you can try, close bug if ok for you?

Meanwhile I'll build on Ubuntu and see if the server stops hogging the CPU.

fundamental commented 10 years ago

Nope, it's still broken. On a clean copy checked out from master I saw the same segfault on make build. I then gave it enough memory to avoid gc during build and now I'm staring at an open instance of gdb which shows that it's hanging around the same section of garbage collection when running the tests.

skaller commented 10 years ago

For me this segfaults:

FLX_MIN_MEM=0 build/release/host/bin/flx --test=build/release --usage=prototype --expect --nonstop --indir=test/regress/rt --regex='.*.flx' test

but if I replace "flx" with "bootflx" it doesn't. Its the same program, except that bootflx includes all the toolchains textually, whereas dflx loads them dynamically, and flx is dflx plus a preloading thunk.

Everything SHOULD work with zero initial memory. You have an infinite loop now in the GC?

fundamental commented 10 years ago

Right now without the environmental variable GC goes into a multi hour loop. Specifying enough memory seems to work and specifying FLX_MIN_MEM=0 works better than it used to, but it still hangs at the 39th test.

skaller commented 10 years ago

what if you use bootflx instead of flx? [eg cp build/release/host/bin/bootflx build/release/host/bin/flx; make test]

fundamental commented 10 years ago

hm, that's interesting. With the copied version it consistently dies on list-02.flx with a "double free or corruption". Running that file on its own I can manage to get it to complete with FLX_MIN_MEM=135 which is the minimum needed to avoid gc. FLX_REPORT_COLLECTIONS also indicates that at smaller sizes the garbage collection runs without a problem several times at the smaller MIN_MEM amounts, but it will eventually crash.