dylan-lang / opendylan

Open Dylan compiler and IDE
http://opendylan.org/
Other
458 stars 69 forks source link

Build failures on Ubuntu 13.10 with Boehm #715

Open gareth-rees opened 10 years ago

gareth-rees commented 10 years ago

I tried bootstrapping Open Dylan on Ubuntu 13.10 on i386 by following the instructions in the README. My configuration command was:

./configure --with-gc=boehm

It completed stages one and two of the bootstrapping, then stage 3 got as far as:

Building coff-builder...

where it stuck. top showed dylan-compiler using 97% CPU after 40 minutes, so I killed the build and re-ran it. This second build got a bit further but then came to a halt with:

fdmake: build product opendylan/Bootstrap.3/databases/dfmc-llvm-back-end.ddb missing
fdmake: build product opendylan/Bootstrap.3/lib/libdfmc-llvm-back-end.so missing

I guessed that I might be missing a dependency, so I installed the clang-3.4 package and tried again. This time it ran to completion.

(It seems that this was just coincidence: see discussion below.)

housel commented 10 years ago

There's no good reason why compiling dfmc-llvm-back-end should require clang-3.4 to be installed. Could you supply the opendylan/Bootstrap.3/build/logs/compile-dfmc-llvm-back-end.txt and opendylan/Bootstrap.3/build/dfmc-llvm-back-end/build.log files for the failing build?

gareth-rees commented 10 years ago

I can't send you the logs for the original build that failed, I'm afraid: the successful bootstrap that I ran afterwards seems to have overwritten them.

So I uninstalled clang-3.4 and re-ran the bootstrap from the beginning. It got through stages 1 and 2, and then in stage 3 it got as far as

Building dfmc-definitions...

and then it stopped. After 40 minutes top showed that dylan-compiler was still using 90% CPU so I killed the build and re-started it. (Something similar happened the first time too: I've updated my original report to mention it, just in case it's significant.) On the second attempt it got a bit further but then came to a halt with:

Building harp-x86... 0 W, 0 SW, 0 E (2.487 seconds)
fdmake: build product opendylan/Bootstrap.3/databases/harp-x86.ddb missing
fdmake: build product opendylan/Bootstrap.3/lib/libharp-x86.so missing
fdmake: compile failed (139), see opendylan/Bootstrap.3/build/logs/compile-harp-x86.txt

Here's the contents of opendylan/Bootstrap.3/build/logs/compile-harp-x86.txt:

Welcome to Open Dylan, Version 2014.1pre.

For documentation on Open Dylan, see http://opendylan.org/documentation/.
See http://opendylan.org/documentation/getting-started-cli/ for an introduction to the command line tools.

Type "help" for more information.
Opened project harp-x86 (/home/gdr/info.ravenbrook.com/project/mps/master/tool/.test/boehm/opendylan/sources/harp/x86/harp-x86.lid)
Loading namespace for library harp-x86
Number of libraries to compile: 31
Library "dylan" is up to date.
Library "common-dylan" is up to date.
Library "io" is up to date.
Library "system" is up to date.
Library "collections" is up to date.
Library "source-records" is up to date.
Library "walker" is up to date.
Library "dfmc-mangling" is up to date.
Library "variable-search" is up to date.
Library "dood" is up to date.
Library "parser-run-time" is up to date.
Library "jam" is up to date.
Library "file-source-records" is up to date.
Library "release-info" is up to date.
Library "build-system" is up to date.
Library "ppml" is up to date.
Library "dfmc-common" is up to date.
Library "dfmc-back-end-protocol" is up to date.
Library "harp-cg-back-end" is up to date.
Library "generic-arithmetic" is up to date.
Library "big-integers" is up to date.
Library "harp" is up to date.
Library "binary-manager" is up to date.
Library "binary-builder" is up to date.
Library "binary-outputter" is up to date.
Library "gnu-as-outputter" is up to date.
Library "coff-manager" is up to date.
Library "coff-builder" is up to date.
Library "harp-coff" is up to date.
Library "harp-native" is up to date.
Updating definitions for library harp-x86
Updating definitions for harp-x86
  Reading and installing: module.dylan

Let me know if there are any other logs you would like to see: I'll preserve the state of this failed build for a bit just in case.

waywardmonkeys commented 10 years ago

This sounds like the build is randomly failing. If you get the situation where it has gone into an infinite loop again, then a backtrace from the debugger would be useful.

What are your platform details? What version of Boehm GC?

waywardmonkeys commented 10 years ago

(And, as housel said, nothing requires Clang / LLVM to be installed on Linux.)

gareth-rees commented 10 years ago

Sure, I understand that I was wrong about the missing dependency: it was just a coincidence that the build failed without Clang installed and then succeeded with Clang installed. I've updated the issue title accordingly.

Platform details: Ubuntu 13.10 32-bit, downloaded from here, running on VMWare Fusion. Boehm GC package version is libgc-dev 1:7.2d-5ubuntu1. (It's what you get if you select the "libgc-dev" package in Ubuntu Software Center.)

I'll start another build and if it gets stuck in the same way, I'll attach the debugger and send you a backtrace.

waywardmonkeys commented 10 years ago

You've run into problems multiple times ... that makes this interesting.

gareth-rees commented 10 years ago

OK, this time it got stuck in stage 3 at:

Building dfmc-modeling...

I attached gdb to the runaway process, and here's the backtrace

#0  0x4073904a in GC_clear_fl_marks () from /usr/lib/i386-linux-gnu/libgc.so.1
#1  0x4073917a in GC_finish_collection ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#2  0x40739785 in GC_try_to_collect_inner ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#3  0x4073a052 in GC_collect_or_expand ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#4  0x4073a1b4 in GC_allocobj () from /usr/lib/i386-linux-gnu/libgc.so.1
#5  0x4073eecf in GC_generic_malloc_inner ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#6  0x4073fde3 in GC_generic_malloc_many ()
   from /usr/lib/i386-linux-gnu/libgc.so.1
#7  0x40748dc7 in GC_malloc () from /usr/lib/i386-linux-gnu/libgc.so.1
#8  0x401c637b in MMAllocateObject (gc_teb=<optimized out>, 
    wrapper=0x410e72c8 <KLsimple_classified_variable_name_fragmentGVdfmc_readerW>, size=20)
    at opendylan/sources/lib/run-time/boehm-collector.c:124
#9  primitive_alloc_s (size=20, 
    wrapper=0x410e72c8 <KLsimple_classified_variable_name_fragmentGVdfmc_readerW>, no_to_fill=4, fill=0x401e6074 <KPunboundVKi>)
    at opendylan/sources/lib/run-time/boehm-collector.c:350
#10 0x4109bf37 in Kmake_identifierVdfmc_readerMM0I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-reader.so
#11 0x4109e551 in Kget_tokenVdfmc_readerMM0I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-reader.so
#12 0x410abb40 in KlexF194I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-reader.so
#13 0x41a628be in Krun_parserVparser_run_timeI ()
   from opendylan/Bootstrap.2/bin/../lib/libparser-run-time.so
#14 0x410ab9b7 in Kread_top_level_fragmentVdfmc_readerMM0I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-reader.so
#15 0x40e464d0 in K168I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-management.so
#16 0x4087900d in Kdo_with_open_source_fileYfile_source_records_implementationVfile_source_recordsI ()
   from opendylan/Bootstrap.2/bin/../lib/libfile-source-records.so
#17 0x40879ca5 in Kcall_with_source_record_input_streamVsource_recordsMfile_source_recordsM0I ()
   from opendylan/Bootstrap.2/bin/../lib/libfile-source-records.so
#18 0x40e463e4 in K165I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-management.so
#19 0x40e4619c in Kcompute_source_record_top_level_formsVdfmc_managementMM0I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-management.so
#20 0xbf9820a4 in ?? ()
#21 0x40e45a41 in Kupdate_compilation_record_definitionsVdfmc_managementI ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-management.so
#22 0x40e4500a in Kensure_library_definitions_installedVdfmc_managementI ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-management.so
#23 0x40e43982 in Kcompute_library_definitionsVdfmc_managementMM0I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-management.so
#24 0x401e662c in KLkeyword_methodGVKe ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan.so
#25 0x401e60ac in KLbooleanGVKd ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan.so
#26 0x0000000c in ?? ()
#27 0xbf98224c in ?? ()
#28 0x404344c4 in Kdo_with_profilingYcommon_dylan_internalsVcommon_dylanI ()
   from opendylan/Bootstrap.2/bin/../lib/libcommon-dylan.so
#29 0x40e43608 in Kdo_timing_compilation_phaseVdfmc_managementI ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-management.so
#30 0x40e48e1e in K328I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-management.so
#31 0x40e3fdf4 in Kdo_with_stage_progressVdfmc_managementI ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-management.so
#32 0x40e48d73 in K322I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-management.so
#33 0x40dda3e6 in K588I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-namespace.so
#34 0x40dda4d9 in Kdo_with_library_descriptionVdfmc_namespaceI ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-namespace.so
#35 0x40dd6246 in Kdo_with_library_contextVdfmc_namespaceMM0I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-namespace.so
#36 0x40e48ac3 in K317I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-management.so
#37 0x40cd47dd in Kdo_with_program_conditionsVdfmc_conditionsI ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-conditions.so
#38 0x40e48960 in Kparse_project_sourcesVdfmc_managementMM0I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-management.so
#39 0x409ef391 in K29I ()
   from opendylan/Bootstrap.2/bin/../lib/libprojects.so
#40 0x40e404c4 in Kdo_with_library_progressVdfmc_managementI ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-management.so
#41 0x409ef279 in Kparse_and_compileYprojects_implementationVprojectsI ()
   from opendylan/Bootstrap.2/bin/../lib/libprojects.so
#42 0x409f1138 in K183I ()
   from opendylan/Bootstrap.2/bin/../lib/libprojects.so
#43 0x409f0a0c in Kupdate_librariesVprojectsMM1I ()
   from opendylan/Bootstrap.2/bin/../lib/libprojects.so
#44 0x4096a980 in K519I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-environment-projects.so
#45 0x409ebbc1 in Kdo_with_dynamic_environmentYprojects_implementationVprojectsI ()
   from opendylan/Bootstrap.2/bin/../lib/libprojects.so
#46 0x4096a46f in K516I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-environment-projects.so
#47 0x409e811a in Kdo_with_used_project_cacheYprojects_implementationVprojectsI
    ()
   from opendylan/Bootstrap.2/bin/../lib/libprojects.so
#48 0x4096a06e in K513I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-environment-projects.so
#49 0x409ecddd in KPwith_compiler_lockYprojects_implementationVprojectsI ()
   from opendylan/Bootstrap.2/bin/../lib/libprojects.so
#50 0xbf982904 in ?? ()
#51 0x40969e96 in Kbuild_projectVenvironment_protocolsMdfmc_environment_projectsM0I ()
   from opendylan/Bootstrap.2/bin/../lib/libdfmc-environment-projects.so
#52 0x4016c2c5 in Khandle_missed_dispatchVKgI ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan.so
#53 0x401ccd3e in general_engine_node_n_optionals ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan.so
#54 0x406ffeba in Kdo_execute_commandVcommandsMenvironment_commandsM12I ()
   from opendylan/Bootstrap.2/bin/../lib/libenvironment-commands.so
#55 0x4016c2c5 in Khandle_missed_dispatchVKgI ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan.so
#56 0x401ccc9f in general_engine_node_n ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan.so
#57 0x40576950 in K165 ()
   from opendylan/Bootstrap.2/bin/../lib/libcommands.so
#58 0xbf982b8c in ?? ()
#59 0x4002f892 in KrunF217I ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan-compiler.so
#60 0x4002f6e9 in Kexecute_main_commandYconsole_environmentVdylan_compilerMM0I
    ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan-compiler.so
#61 0x4002f373 in Kdo_execute_commandVcommandsMdylan_compilerM0I ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan-compiler.so
#62 0x4016c2c5 in Khandle_missed_dispatchVKgI ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan.so
#63 0x401ccc9f in general_engine_node_n ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan.so
#64 0x40576950 in K165 ()
   from opendylan/Bootstrap.2/bin/../lib/libcommands.so
#65 0xbf982cbc in ?? ()
#66 0x4016c2c5 in Khandle_missed_dispatchVKgI ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan.so
#67 0x401ccc9f in general_engine_node_n ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan.so
#68 0x40576918 in K159 ()
   from opendylan/Bootstrap.2/bin/../lib/libcommands.so
#69 0xbf982d00 in ?? ()
#70 0x400316c7 in KmainYconsole_environmentVdylan_compilerI ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan-compiler.so
#71 0x4003185c in _Init_dylan_compiler__X_start_for_user_0 ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan-compiler.so
#72 0x40031866 in _Init_dylan_compiler__X_start_for_user ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan-compiler.so
#73 0x4002ef96 in _Init_dylan_compiler__local_ ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan-compiler.so
#74 0x4002ee5b in call_init_dylan ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan-compiler.so
#75 0x401c76af in dylan_init_thread (rReturn=0xbf982d90, 
    f=0x4002ee4e <call_init_dylan>, p=0x0, s=0)
    at opendylan/sources/lib/run-time/exceptions.c:46
#76 0x4002ee99 in dylan_initialize ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan-compiler.so
#77 0x4002eed5 in DylanSOEntry ()
   from opendylan/Bootstrap.2/bin/../lib/libdylan-compiler.so
#78 0x0804867c in main ()

I'll keep the VM state with the debugger connected to the runaway process: if there's anything you'd like me to investigate from here, let me know.

waywardmonkeys commented 10 years ago

If you can 'cont' and break in a couple of times, it would be useful to know:

gareth-rees commented 10 years ago

I tried this a few times: the stack is always the same. It looks as though the loop in GC_clear_fl_marks is not finishing. From reading the code it looks as though this will happen whenever there is a cycle in the graph q → obj_link(q). I found a thread on the Boehm-gc mailing list suggesting that this can be caused by a double free.

(I wonder why Boehm GC doesn't detect the cycle here and assert instead of going into an infinite loop? It would be easy to do Floyd cycle detection.)

brucehoult commented 10 years ago

The GC keeps a list of free objects for each size object. When a new object is wanted the first object in the appropriate free list is unlinked and returned. If the free list is empty then an unused memory block (normally a VM page on the CPU being being used, i.e. probably 4 KB) is found or created, subdivided into equal-sized objects, and linked together as the new free list.

In normal use a free list will always consist of a simple linear list of objects from a single VM page, ordered by ascending memory address.

This code is extremely simple, robust, and well tested. It does not need any cycle detection slowing it down.

Therefore, cycles can only be created by unsafe code memory-stomping the data structure.

If you have a memory stomper then absolutely anything could happen and a loop in GC_clear_fl_marks() is probably one of the more benign cases.

The exception to this is GC_free(). GC_free() adds the object to the start of the free list. When you next allocate an object of that size (of any class) you WILL get back the object most recently GC_free()'d.

GC_free() is never necessary. It is purely a performance hack. If you use it you turn your safe language (e.g. Dylan) into an unsafe language no better than C.

If you call GC_free() on an object that you still have another reference to then very bad things can happen. You will probably very quickly end up with two objects in your program that share the same memory. They might be of different classes. Assignments to fields of one object (e.g. the initializer of the new object) will change fields in the other object. All kinds of invariants that user code or your safe-language compiler rely on can be violated.

A second call to GC_free() causing a loop in the free list is once again probably one of the most benign things that can happen. The program does not continue. You don't get wrong results later.

The program stopping with an error message would be slightly more helpful, but is it worth slowing down GC_clear_fl_marks() for normal code, just to compensate for incorrect use of an inherently unsafe performance hack, when the other likely results from calling GC_free() on an object you're still using are so much worse? I don't think so.

If there should be a check for free list loops anywhere, it should be in the only non-stomper place that can create them: GC_free() itself.

Stepping through the entire free list checking that the object you're freeing isn't already on the free list (or checking for a preexisting loop) is almost certainly more expensive than simply not using GC_free() and allowing the object to be GC'd in the normal way.

Don't use GC_free() unless you are really really really sure that you are using it properly.

And even then, you are probably better off using your own explicit free list for objects of that class. You'll still get incorrect results if you make a mistake, but they'll be a lot less wrong :)

waywardmonkeys commented 10 years ago

Using current master, you can do a bootstrap with --with-gc=malloc. This should let you run the compiler under valgrind and get at least some information back.

On Mac OS X, this doesn't report anything serious, but you could give it a try on x86-linux.

One issue is that the way that command line args are picked up on Mac OS X doesn't work well with how valgrind invokes dylan-compiler. If x86-linux has the same issue, then just try this command line:

valgrind path/to/malloc-using/dylan-compiler -build

And it'll try to build the entire compiler. This will be slow and take a LOT of memory, but it should help pinpoint the cause of this error if there's a memory smashing bug and it might help with tracking down if we're calling GC_FREE multiple times.

waywardmonkeys commented 10 years ago

@gareth-rees, did you get a chance to try this with --with-gc=malloc?

gareth-rees commented 10 years ago

Not yet, but it's on my to-do list.

waywardmonkeys commented 10 years ago

Okay. @fracek and I are working on a much more extensive set of changes that are intended to start getting the C back-end and run-time working with both Boehm and MPS. We have some fun crashes in that still though without even trying MPS yet. :)

waywardmonkeys commented 10 years ago

This should be fixed on master now.

MMAllocMisc was implemented with GC_MALLOC_ATOMIC and MMFreeMisc was implemented with GC_FREE. I believe that there was a chance that MMFreeMisc could run after the underlying memory had actually been recovered by the GC. This would result in GC_FREE being called on an invalid pointer (since the GC had already recovered the memory).

If that is the case, then this has been fixed by my change to have MMAllocMisc use GC_MALLOC_UNCOLLECTABLE in 5151f476d0aed617556c577c9d2429f26813ae70.

brucehoult commented 10 years ago

This seems like a most startling change!

Changing from GC_MALLOC_ATOMIC to GC_MALLOC_UNCOLLECTABLE implies that both attributes were chosen incorrectly before!

Previously the object could be collected if nothing else referred to it, but it was not itself scanned for pointers to other objects. Now, it can't be garbage collected, but needs to be scanned for pointers?

Are you sure it shouldn't be GC_MALLOC_ATOMIC_UNCOLLECTABLE? (i.e. semantically identical to malloc(), except implemented in the GC)

For sure I don't know what the intended use is, so I don't know.

Another question raised is this: if you had a pointer to the object in order to call GC_FREE() on it, then how/why could the GC have reclaimed it already?

This whole thing smacks of premature optimisation.

GC_FREE() is seldom a performance gain and carries HUGE risks. GC_MALLOC_UNCOLLECTABLE() is never a performance gain compared to simply making sure that the place you keep the pointer to it (which you must be if you're able to call GC_FREE() later) is somewhere that the GC will scan.

Virtually all the time, using simple GC_MALLOC() and allowing the GC to do the job it was designed to do is both the safest and the fastest thing to do.

GC_MALLOC_ATOMIC() can be a valuable performance tool when you know for sure something is both large and can't contain pointers e.g. character strings, sound or image data, but using it when you shouldn't is a risk.

waywardmonkeys commented 10 years ago

Relax @brucehoult!

I think that this fixes the crash that he was seeing, but I'm not sure. I know that correcting the implementation of MMAllocMisc on our merge-runtimes branch in this way fixed some of our crashes.

We're trying to make the characteristics under Boehm work the same as under MPS for some core functionality. This is part of being able to run with either Boehm or MPS with either back-end & run-time. With MPS, MMAllocMisc is implemented using a manually managed storage pool and the result must be manually freed with MMFreeMisc. Whether or not this pool is scanned for pointers remains a subject of some confusion for me so far. Given that we manually manage it in the MPS code, we want to manually manage it in the Boehm code and the best way to do that is via GC_MALLOC_UNCOLLECTABLE and GC_FREE.

The way that MMAllocMisc is often used is like this:

define macro with-storage
  { with-storage (?:name, ?size:expression) ?:body end }
  => { begin
         let ?name = primitive-wrap-machine-word(integer-as-raw(0));
         block ()
           ?name := primitive-wrap-machine-word
                      (primitive-cast-pointer-as-raw
                         (%call-c-function ("MMAllocMisc")
                            (nbytes :: <raw-c-unsigned-long>) => (p :: <raw-c-pointer>)
                            (integer-as-raw(?size))
                          end));
           if (primitive-machine-word-equal?
                 (primitive-unwrap-machine-word(?name), integer-as-raw(0)))
             error("unable to allocate %d bytes of storage", ?size);
           end;
           ?body
         cleanup
           if (primitive-machine-word-not-equal?
                 (primitive-unwrap-machine-word(?name), integer-as-raw(0)))
             %call-c-function ("MMFreeMisc")
               (p :: <raw-c-pointer>, nbytes :: <raw-c-unsigned-long>) => (void :: <raw-c-void>)
                 (primitive-cast-raw-as-pointer(primitive-unwrap-machine-word(?name)),
                  integer-as-raw(?size))
             end;
             #f
           end
         end
       end }
end macro with-storage;

It may well be that storing the pointer in a <machine-word> is enough to mask things such that the GC didn't see it.

I'll be interested in seeing if @gareth-rees is able to build or not with the current master and his configuration.