CTSRD-CHERI / cheribsd

FreeBSD adapted for CHERI-RISC-V and Arm Morello.
http://cheribsd.org
Other
169 stars 60 forks source link

High build parallelism results in failure #1524

Open nwf opened 2 years ago

nwf commented 2 years ago

Smells like a missing dependency or .WAIT somewhere:

ld.lld: error: cannot open crtendS.o: No such file or directory
clang-13: error: linker command failed with exit code 1 (use -v to see invocation)
--- libgssapi_ntlm.so.10.full ---             
*** Failed target: libgssapi_ntlm.so.10.full                                
*** Failed commands:                                
        @${ECHO} building shared library ${SHLIB_NAME}                                                                                                        
        => @true building shared library libgssapi_ntlm.so.10
        @rm -f ${SHLIB_NAME} ${SHLIB_LINK}                          
        => @rm -f libgssapi_ntlm.so.10 libgssapi_ntlm.so
        ${_LD:N${CCACHE_BIN}} ${LDFLAGS} ${SSP_CFLAGS} ${SOLINKOPTS}  -o ${.TARGET} -Wl,-soname,${SONAME} ${SOBJS} ${LDADD}
        => /cheri/out/mainline/sdk/bin/ccache-clang -target riscv64-unknown-freebsd14.0 --sysroot=/cheri/build/cornucopia-modernize/cheribsd-riscv64-purecap-build/cheri/source/cornucopia-modernize/cheribsd/riscv.riscv64c/tmp -B/cheri/build/cornucopia-modernize/cheribsd-riscv64-purecap-build/cheri/source/cornucopia-m
odernize/cheribsd/riscv.riscv64c/tmp/usr/bin -Wl,-Bsymbolic -Wl,--no-undefined -march=rv64imafdcxcheri -mabi=l64pc128d -Wl,-zrelro   --ld-path=/cheri/out/mainline/sdk/bin/ld.lld  -shared -Wl,-x -Wl,--fatal-warnings -Wl,--warn-shared-textrel  -o libgssapi_ntlm.so.10.full -Wl,-soname,libgssapi_ntlm.so.10 accept_sec_co
ntext.pico acquire_cred.pico add_cred.pico canonicalize_name.pico compare_name.pico context_time.pico creds.pico crypto.pico delete_sec_context.pico display_name.pico display_status.pico duplicate_name.pico export_name.pico export_sec_context.pico external.pico import_name.pico import_sec_context.pico indicate_mechs
.pico init_sec_context.pico inquire_context.pico inquire_cred_by_mech.pico inquire_mechs_for_name.pico inquire_names_for_mech.pico inquire_sec_context_by_oid.pico iter_cred.pico kdc.pico prefix.pico process_context_token.pico release_cred.pico release_name.pico gss_oid.pico  -lcrypto  -lgssapi  -lkrb5  -lheimntlm  -
lroken 
nwf commented 1 year ago

Here's another variant of the same thing:

ld.lld: error: cannot open crtbeginS.o: No such file or directory
ld.lld: error: unable to find library -lgcc                            
ld.lld: error: unable to find library -lgcc 
clang-13: error: linker command failed with exit code 1 (use -v to see invocation)
--- libprivateheimipcs.so.11.full ---           
*** Failed target: libprivateheimipcs.so.11.full                 
*** Failed commands:                                    
        @${ECHO} building shared library ${SHLIB_NAME}                 
        => @true building shared library libprivateheimipcs.so.11
        @rm -f ${SHLIB_NAME} ${SHLIB_LINK}                             
        => @rm -f libprivateheimipcs.so.11 libprivateheimipcs.so
        ${_LD:N${CCACHE_BIN}} ${LDFLAGS} ${SSP_CFLAGS} ${SOLINKOPTS}  -o ${.TARGET} -Wl,-soname,${SONAME} ${SOBJS} ${LDADD}
        => /cheri/out/mainline/sdk/bin/ccache-clang -target riscv64-unknown-freebsd14.0 --sysroot=/cheri/build/mainline/cheribsd-riscv64-purecap-build/cheri/source/mainline/cheribsd/riscv.riscv64c/tmp -B/cheri/build/mainline/cheribsd-riscv64-purecap-build/cheri/source/mainline/cheribsd/riscv.riscv64c/tmp/usr/bin  -m
arch=rv64imafdcxcheri -mabi=l64pc128d -Wl,-zrelro   --ld-path=/cheri/out/mainline/sdk/bin/ld.lld  -shared -Wl,-x -Wl,--fatal-warnings -Wl,--warn-shared-textrel  -o libprivateheimipcs.so.11.full -Wl,-soname,libprivateheimipcs.so.11 server.pico common.pico  -lheimbase  -lroken  -lpthread 
*** [libprivateheimipcs.so.11.full] Error code 1
jrtc27 commented 1 year ago

I don't see how there could possibly be a race, crtbeginS.o comes from lib/csu which is in _startup_libs and libgcc comes from lib/libcompiler_rt which is in _prereq_libs. Are you sure this isn't a ccache issue given you've hacked your local environment up to use it and it shows in both error reports, and you're the only one to have seen this?

nwf commented 1 year ago

I'm not sure that this isn't a ccache issue, but AFAIK if ccache is invoking ld it's because it hasn't done the caching thing. FWIW, I suspect I'm also the only one building with -j160 and it's also possibly interesting that both reports are from the Kerberos-related part of the tree (in _prebuild_libs from the looks of it)?

ETA: so far, every time this has happened, it's sufficed to just restart the build, suggesting that whatever is going on is a function of transitory state.

jrtc27 commented 1 year ago

Those are in _prebuild_libs, which come strictly after _startup_libs and _prereq_libs; see the libraries target in Makefile.inc1 which is very definitely not parallelised. The error is likely not here but something went wrong with ccache earlier such that it didn't produce crtbeginS.o.

nwf commented 1 year ago

Hm. I hit this again,

ld.lld: error: cannot open crtendS.o: No such file or directory
ld.lld: error: cannot open crtn.o: No such file or directory
clang-13: error: linker command failed with exit code 1 (use -v to see invocation)
--- libgssapi_krb5.so.10.full ---
*** Failed target: libgssapi_krb5.so.10.full
*** Failed commands:
        @${ECHO} building shared library ${SHLIB_NAME}
        => @true building shared library libgssapi_krb5.so.10
        @rm -f ${SHLIB_NAME} ${SHLIB_LINK}
        => @rm -f libgssapi_krb5.so.10 libgssapi_krb5.so
        ${_LD:N${CCACHE_BIN}} ${LDFLAGS} ${SSP_CFLAGS} ${SOLINKOPTS}  -o ${.TARGET} -Wl,-soname,${SONAME} ${SOBJS} ${LDADD}
        => /cheri/out/mainline/sdk/bin/ccache-clang -target riscv64-unknown-freebsd14.0 --sysroot=/cheri/build/cornucopia-modernize/cheribsd-riscv64-build/cheri/source/cornucopia-modernize/cheribsd/riscv.riscv64/tmp -B/cheri/build/cornucopia-modernize/cheribsd-riscv64-build/cheri/source/cornucopia-modernize/cheribsd/riscv.riscv64/tmp/usr/bin -Wl,-Bsymbolic -Wl,--no-undefined -march=rv64imafdc -mabi=lp64d -Wl,-zrelro   --ld-path=/cheri/out/mainline/sdk/bin/ld.lld -fstack-protector-strong -shared -Wl,-x -Wl,--fatal-warnings -Wl,--warn-shared-textrel  -o libgssapi_krb5.so.10.full -Wl,-soname,libgssapi_krb5.so.10 8003.pico accept_sec_context.pico acquire_cred.pico add_cred.pico address_to_krb5addr.pico aeap.pico arcfour.pico authorize_localname.pico canonicalize_name.pico ccache_name.pico cfx.pico compare_name.pico compat.pico context_time.pico copy_ccache.pico creds.pico decapsulate.pico delete_sec_context.pico display_name.pico display_status.pico duplicate_name.pico encapsulate.pico export_name.pico export_sec_context.pico external.pico get_mic.pico gkrb5_err.pico import_name.pico import_sec_context.pico indicate_mechs.pico init.pico init_sec_context.pico inquire_context.pico inquire_cred.pico inquire_cred_by_mech.pico inquire_cred_by_oid.pico inquire_mechs_for_name.pico inquire_names_for_mech.pico inquire_sec_context_by_oid.pico pname_to_uid.pico prefix.pico prf.pico process_context_token.pico release_buffer.pico release_cred.pico release_name.pico sequence.pico set_cred_option.pico set_sec_context_option.pico store_cred.pico ticket_flags.pico unwrap.pico verify_mic.pico wrap.pico gss_krb5.pico gss_oid.pico  -lgssapi  -lkrb5  -lcrypto  -lroken  -lasn1  -lcom_err
*** [libgssapi_krb5.so.10.full] Error code 1

and it's still in scrollback so I can look further down as bmake bails out. Of course there's the path to building this target, with a little bit intermixed

bmake[5]: stopped in /cheri/source/cornucopia-modernize/cheribsd/kerberos5/lib/libgssapi_krb5
1 error
bmake[5]: stopped in /cheri/source/cornucopia-modernize/cheribsd/kerberos5/lib/libgssapi_krb5
--- all_subdir_kerberos5/lib/libgssapi_krb5 ---
bmake[4]: stopped in /cheri/source/cornucopia-modernize/cheribsd/kerberos5/lib
--- realinstall_subdir_lib/libngatm ---
bmake[4]: stopped in /cheri/source/cornucopia-modernize/cheribsd/lib
--- kerberos5/lib__L ---
bmake[3]: stopped in /cheri/source/cornucopia-modernize/cheribsd

as expected, and a bunch of things like

--- realinstall_subdir_lib/geom/raid ---
bmake[5]: stopped in /cheri/source/cornucopia-modernize/cheribsd/lib/geom
--- realinstall_subdir_lib/liblzma ---
bmake[4]: stopped in /cheri/source/cornucopia-modernize/cheribsd/lib
--- realinstall_subdir_lib/libdevctl ---
bmake[4]: stopped in /cheri/source/cornucopia-modernize/cheribsd/lib

but also, suspiciously,

--- realinstall_subdir_lib/csu/riscv ---
bmake[5]: stopped in /cheri/source/cornucopia-modernize/cheribsd/lib/csu
--- realinstall_subdir_lib/csu ---
bmake[4]: stopped in /cheri/source/cornucopia-modernize/cheribsd/lib

The end of the bmake spew, FWIW, is

--- lib__L ---
bmake[3]: stopped in /cheri/source/cornucopia-modernize/cheribsd
--- libraries ---
bmake[2]: stopped in /cheri/source/cornucopia-modernize/cheribsd
Command exited with non-zero status 2
95.72user 24.83system 0:47.74elapsed 252%CPU (0avgtext+0avgdata 221888maxresident)k
3572154inputs+1826372outputs (19263major+616271minor)pagefaults 0swaps
--- _libraries ---
bmake[1]: stopped in /cheri/source/cornucopia-modernize/cheribsd
--- buildworld ---

(ETA: formatting) (ETA2: kerberos5/lib__L)

jrtc27 commented 1 year ago

Hm, _generic_libs= ${_cddl_lib} gnu/lib ${_kerberos5_lib} lib ${_secure_lib} and lib/Makefile also builds csu etc; the latter has .WAITs in it but the former doesn't stop kerberos libs being rebuilt. I guess https://github.com/CTSRD-CHERI/cheribsd/commit/67a7d46cb7294c24c18d5a093196bc455fb50abf is rearing its head again, just with inputs that get statically linked, not just shared libraries :(

nwf commented 1 year ago

As a very crude workaround, then, would

_generic_libs=  ${_cddl_lib} gnu/lib ${_kerberos5_lib} .WAIT lib .WAIT ${_secure_lib}

possibly do the right thing, minimizing the concurrent excitement around lib's descents into subdirs? (Dare I ask why lib/Makefile is descending into csu at all given the special handling in Makefile.inc1?)

brooksdavis commented 1 year ago

That does seem like it should work. It might be that the a better answer is for lib/Makefile to be informed which stage it's in to avoid reinstalling.

It also does seem that if we had renameat2() with RENAME_EXCHANGE in FreeBSD then we could alter install(1) to not be subject to races on reinstall.

nwf commented 1 year ago

Why RENAME_EXCHANGE?

brooksdavis commented 1 year ago

Why RENAME_EXCHANGE?

If you install the file in a tmp file and then use RENAME_EXCHANGE do a swap and then delete the old one, all openers will always find a file and a complete one at that.

nwf commented 1 year ago

Isn't that true of rename proper (he asks, knowing that POSIX is a beast quick to anger)? At least my reading of

If the link named by the new argument exists, it shall be removed and old renamed to new. In this case, a link named new shall remain visible to other threads throughout the renaming operation and refer either to the file referred to by new or old before the operation began.

(emphasis mine) from https://pubs.opengroup.org/onlinepubs/9699919799/functions/rename.html suggests that renaming over an existing file doesn't have a hole where the target path doesn't exist?

brooksdavis commented 1 year ago

Indeed, you are right. I wonder if we just need to use install -S more aggressively. This might be a bit too much of a hammer, but I think this should work:

diff --git a/share/mk/bsd.lib.mk b/share/mk/bsd.lib.mk
index c98dec9045bf..970052164550 100644
--- a/share/mk/bsd.lib.mk
+++ b/share/mk/bsd.lib.mk
@@ -449,6 +449,7 @@ SHLINSTALLFLAGS+= -fschg
 # time round, but for now using -S ensures the install is atomic and thus we
 # never see a broken intermediate state, so use it even for NO_ROOT builds.
 .if !defined(NO_SAFE_LIBINSTALL) #&& !defined(NO_ROOT)
+INSTALLFLAGS+= -S
 SHLINSTALLFLAGS+= -S
 SHLINSTALLSYMLINKFLAGS+= -S
 .endif
nwf commented 1 year ago

GitHub needs a 🔨 reaction. :)

nwf commented 1 year ago

Oo, another instance of the same underlying cause cropping up elsewhere? I just saw

===> usr.sbin/nmtree (obj,all,install)
/cheri/source/mainline/cheribsd/tools/install.sh: line 85: /cheri/build/mainline/cheribsd-riscv64-hybrid-build/cheri/source/mainline/cheribsd/riscv.riscv64/tmp/legacy/usr/sbin/install: Permission denied
/cheri/source/mainline/cheribsd/tools/install.sh: line 85: exec: /cheri/build/mainline/cheribsd-riscv64-hybrid-build/cheri/source/mainline/cheribsd/riscv.riscv64/tmp/legacy/usr/sbin/install: cannot execute: Permission denied
--- installdirs-NLSDIR ---
*** Failed target: installdirs-NLSDIR
*** Failed commands:
        @${ECHO} installing DIRS ${_alldirs_${:UNLSDIR}}
        => @true installing DIRS NLSDIR
        ${INSTALL} ${${:UNLSDIR}TAG_ARGS} -d -m ${${:UNLSDIR}_MODE} -o ${${:UNLSDIR}_OWN}  -g ${${:UNLSDIR}_GRP} ${${:UNLSDIR}_FLAG} ${DESTDIR}${${:UNLSDIR}}
        => sh /cheri/source/mainline/cheribsd/tools/install.sh -T package=utilities -d -m 0755 -o root  -g wheel  /cheri/build/mainline/cheribsd-riscv64-hybrid-build/cheri/source/mainline/cheribsd/riscv.riscv64/tmp/legacy/usr/share/nls
*** [installdirs-NLSDIR] Error code 126
bmake[3]: stopped in /cheri/source/mainline/cheribsd/usr.bin/sort
--- realinstall_subdir_usr.sbin/zic/zic ---
bmake[3]: stopped in /cheri/source/mainline/cheribsd/usr.sbin/zic
--- _bootstrap-tools-usr.bin/grep ---
bmake[2]: stopped in /cheri/source/mainline/cheribsd
--- _bootstrap-tools-sbin/md5 ---
bmake[2]: stopped in /cheri/source/mainline/cheribsd
--- realinstall_subdir_usr.sbin/zic/zdump ---
bmake[3]: stopped in /cheri/source/mainline/cheribsd/usr.sbin/zic
--- _bootstrap-tools-usr.sbin/zic ---
bmake[2]: stopped in /cheri/source/mainline/cheribsd
1 error
bmake[3]: stopped in /cheri/source/mainline/cheribsd/usr.bin/sort
--- _bootstrap-tools-usr.bin/sort ---
bmake[2]: stopped in /cheri/source/mainline/cheribsd
--- _bootstrap-tools-usr.bin/mandoc ---
bmake[2]: stopped in /cheri/source/mainline/cheribsd
--- _bootstrap-tools-tools/build/bootstrap-m4 ---
bmake[2]: stopped in /cheri/source/mainline/cheribsd
--- _bootstrap-tools-usr.sbin/nmtree ---
bmake[2]: stopped in /cheri/source/mainline/cheribsd
--- _bootstrap-tools-libexec/flua ---
bmake[2]: stopped in /cheri/source/mainline/cheribsd
--- _bootstrap-tools-lib/libzstd ---
bmake[2]: stopped in /cheri/source/mainline/cheribsd
Command exited with non-zero status 2
216.25user 72.05system 0:35.58elapsed 810%CPU (0avgtext+0avgdata 223168maxresident)k
1911606inputs+169967outputs (97091major+1158097minor)pagefaults 0swaps
--- _bootstrap-tools ---
bmake[1]: stopped in /cheri/source/mainline/cheribsd
--- buildworld ---

with

/cheri/source/mainline/cheribsd/usr.bin/xinstall/xinstall.c:813:60: warning: unused parameter 'fset' [-Wunused-parameter]
install(const char *from_name, const char *to_name, u_long fset, u_int flags)
                                                           ^
/cheri/source/mainline/cheribsd/usr.bin/xinstall/xinstall.c:1237:59: warning: unused parameter 'sbp' [-Wunused-parameter]

right before it. Looks like we're clobbering the install program in parallel with someone trying to install. This is without the -S suggestion above; the failing build was part of a larger script that had been running for a long while.

brooksdavis commented 1 year ago

Hmm, so my suggestion above won't fix the install issue because it will only effect non-shared libraries (shared libraries already use -S in CheriBSD). It looks like the easy fix would be to define PRECIOUSPROG and NO_FSCHG in usr.bin/xinstall/Makefile.

brooksdavis commented 1 year ago

I wondered a bit if we should just always be using -S. I applied a patch to xinstall.c to just enable it all the time and while the difference is measurable, it's pretty small. Here's ministat output for wall clock time doing cheribuild cheribsd-morello-purecap where make is invoked with -j40. This is on a previouly built tree so we're mostly (pointlessly) reinstalling things:

x default.stats
+ safecopy.stats
+------------------------------------------------------------------------------+
|   x   xx               x      +              x             + +   +          +|
||_______M_________A_________________|     |________________A__M_____________| |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        143.22       144.579       143.394      143.6904    0.55837649
+   5       144.112       145.534       145.071      144.9786    0.52705484
Difference at 95.0% confidence
        1.2882 +/- 0.791849
        0.896511% +/- 0.553698%
        (Student's t, pooled s = 0.542942)

Both user and sys time have no significant difference. I need to do a bit more testing, but I tempted to just make -S a no-op.

brooksdavis commented 1 year ago

Since @dch asked, the patch is:

diff --git a/usr.bin/xinstall/xinstall.c b/usr.bin/xinstall/xinstall.c
index 05b1444506db..43b11d1627db 100644
--- a/usr.bin/xinstall/xinstall.c
+++ b/usr.bin/xinstall/xinstall.c
@@ -121,7 +121,8 @@ extern char **environ;
 static gid_t gid;
 static uid_t uid;
 static int dobackup, docompare, dodir, dolink, dopreserve, dostrip, dounpriv,
-    safecopy, verbose;
+    verbose;
+static int safecopy = 1;
 static int haveopt_f, haveopt_g, haveopt_m, haveopt_o;
 static mode_t mode = S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH;
 static FILE *metafp;