Closed jameshcorbett closed 4 months ago
Is it possible to get that dumped core by any chance? I can think of a way an insatiable job might get a different ret code, but not the other.
I think the flux-core spec file has some code, possibly commented out, that will dump the output files from failed tests. Or you can temporarily cat *. output in the specfile. If you have tests that only fail in the buildfarm but succeed on TOSS 4 actual systems, then you could consider disabling tests for this release.
Got it, there was some logic there in the sched specfile to dump the output of failed tests but it looks like it's broken. I'll see what I can do...
This logic:
ctest -V %{?_smp_mflags} ||
(echo done testing && ls -l && find . -name *.output && find . -name *.output | xargs -ti cat {} && find . -name *.log | xargs -ti cat {} &&
echo -e "thread apply all bt\nquit" >cmds &&
for core in $(find . -name *core*); do
echo found corefile $core &&
gdb --command=cmds /usr/libexec/flux/cmd/flux-broker $core;
done && exit 1)
isn't giving me anything, it only finds /builddir/build/BUILD/flux-sched-0.36.0/./t/data/resource/jgfs/hwloc_4core.json
Dang. The tests should all dump verbose output if ctest is run with -V or --output-on-failure. Not sure about cores, is it using coredumpctl?
I'm not familiar with coredumpctl
but running coredumpctl list
in the specfile gave me:
coredumpctl list
Failed to acquire bus: No such file or directory
No journal files were found.
No coredumps found.
However looking in the logs I found I had overlooked earlier:
28/94 Testing: t1020-qmanager-feasibility.t
28/94 Test: t1020-qmanager-feasibility.t
Command: "/builddir/build/BUILD/flux-sched-0.36.0/t/scripts/maybe-installtest" "/builddir/build/BUILD/flux-sched-0.36.0/t/t1020-qmanager-feasibility.t" "-v"
Directory: /builddir/build/BUILD/flux-sched-0.36.0/t
"t1020-qmanager-feasibility.t" start time: Jul 04 17:41 PDT
Output:
----------------------------------------------------------
expecting success:
load_test_resources ${excl_1N1B}
{"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0-15"}}],"starttime":0.0,"expiration":0.0,"nodelist":["cab1234"]}}
ok 1 - feasibility: loading test resources works
ok 1 - feasibility: loading test resources works
feasibility: loading test resources works
expecting success:
flux module load sched-fluxion-resource prune-filters=ALL:core \
subsystems=containment policy=low &&
load_qmanager
ok 2 - feasibility: loading resource and qmanager modules works
ok 2 - feasibility: loading resource and qmanager modules works
feasibility: loading resource and qmanager modules works
expecting success:
flux run -n 999 --dry-run hostname | \
flux job-validator --jobspec-only --plugins=feasibility \
| jq -e ".errnum != 0" &&
flux run -n 999 --dry-run hostname | \
flux job-validator --jobspec-only \
--plugins=feasibility \
--feasibility-service=sched-fluxion-resource.satisfiability \
| jq -e ".errnum != 0"
true
true
ok 3 - feasibility: --plugins=feasibility works
ok 3 - feasibility: --plugins=feasibility works
feasibility: --plugins=feasibility works
expecting success:
flux module reload job-ingest validator-plugins=feasibility
ok 4 - feasibility: loading job-ingest with feasibilty works
ok 4 - feasibility: loading job-ingest with feasibilty works
feasibility: loading job-ingest with feasibilty works
expecting success:
test_must_fail flux submit -n 170 hostname 2>err1 &&
grep "Unsatisfiable request" err1 &&
test_must_fail flux submit -N 2 -n2 hostname 2>err2 &&
grep "Unsatisfiable request" err2 &&
test_must_fail flux submit -g 1 hostname 2>err3 &&
grep -i "Unsatisfiable request" err3
[Errno 19] Unsatisfiable request
[Errno 19] Unsatisfiable request
flux-start: 0 (pid 3650164) Aborted
not ok 5 - feasibility: unsatisfiable jobs are rejected
not ok 5 - feasibility: unsatisfiable jobs are rejected
#
# test_must_fail flux submit -n 170 hostname 2>err1 &&
# grep "Unsatisfiable request" err1 &&
# test_must_fail flux submit -N 2 -n2 hostname 2>err2 &&
# grep "Unsatisfiable request" err2 &&
# test_must_fail flux submit -g 1 hostname 2>err3 &&
# grep -i "Unsatisfiable request" err3
#
expecting success:
jobid=$(flux submit -n 8 hostname) &&
flux job wait-event -t 10 ${jobid} start &&
flux job wait-event -t 10 ${jobid} finish &&
flux job wait-event -t 10 ${jobid} release &&
flux job wait-event -t 10 ${jobid} clean
flux-submit: ERROR: Unable to connect to Flux: broker socket /tmp/flux-WCxcEG/local-0 was not found
not ok 6 - feasibility: satisfiable jobs are accepted
not ok 6 - feasibility: satisfiable jobs are accepted
#
# jobid=$(flux submit -n 8 hostname) &&
# flux job wait-event -t 10 ${jobid} start &&
# flux job wait-event -t 10 ${jobid} finish &&
# flux job wait-event -t 10 ${jobid} release &&
# flux job wait-event -t 10 ${jobid} clean
#
expecting success:
flux module reload job-ingest validator-plugins=feasibility,jobspec
flux-module: flux_open: No such file or directory
not ok 7 - feasibility: load job-ingest with two validators
not ok 7 - feasibility: load job-ingest with two validators
#
# flux module reload job-ingest validator-plugins=feasibility,jobspec
#
expecting success:
test_must_fail flux submit -n 170 hostname 2>err4 &&
grep "Unsatisfiable request" err4 &&
test_must_fail flux submit -N 2 -n2 hostname 2>err5 &&
grep "Unsatisfiable request" err5 &&
test_must_fail flux submit -g 1 hostname 2>err6 &&
grep -i "Unsatisfiable request" err6
not ok 8 - feasibility: unsatisfiable jobs are rejected
not ok 8 - feasibility: unsatisfiable jobs are rejected
#
# test_must_fail flux submit -n 170 hostname 2>err4 &&
# grep "Unsatisfiable request" err4 &&
# test_must_fail flux submit -N 2 -n2 hostname 2>err5 &&
# grep "Unsatisfiable request" err5 &&
# test_must_fail flux submit -g 1 hostname 2>err6 &&
# grep -i "Unsatisfiable request" err6
#
expecting success:
jobid=$(flux submit -n 8 hostname) &&
flux job wait-event -t 10 ${jobid} start &&
flux job wait-event -t 10 ${jobid} finish &&
flux job wait-event -t 10 ${jobid} release &&
flux job wait-event -t 10 ${jobid} clean
flux-submit: ERROR: Unable to connect to Flux: broker socket /tmp/flux-WCxcEG/local-0 was not found
not ok 9 - feasibility: satisfiable jobs are accepted
not ok 9 - feasibility: satisfiable jobs are accepted
#
# jobid=$(flux submit -n 8 hostname) &&
# flux job wait-event -t 10 ${jobid} start &&
# flux job wait-event -t 10 ${jobid} finish &&
# flux job wait-event -t 10 ${jobid} release &&
# flux job wait-event -t 10 ${jobid} clean
#
expecting success:
cleanup_active_jobs
flux-queue: ERROR: Unable to connect to Flux: broker socket /tmp/flux-WCxcEG/local-0 was not found
not ok 10 - feasibility: cleanup active jobs
not ok 10 - feasibility: cleanup active jobs
#
# cleanup_active_jobs
#
expecting success:
remove_qmanager &&
remove_resource
flux-module: flux_open: No such file or directory
not ok 11 - feasibility: removing resource and qmanager modules
not ok 11 - feasibility: removing resource and qmanager modules
#
# remove_qmanager &&
# remove_resource
#
# failed 7 among 11 test(s)
1..11
# failed 7 among 11 test(s)
1..11
<end of output>
Test time = 7.21 sec
----------------------------------------------------------
Test Failed.
"t1020-qmanager-feasibility.t" end time: Jul 04 17:41 PDT
"t1020-qmanager-feasibility.t" time elapsed: 00:00:07
---------------------------------------------------------
cat ./t/t1020-qmanager-feasibility.broker.log
Jul 05 00:41:31.563654 UTC broker.debug[0]: insmod connector-local
Jul 05 00:41:31.563860 UTC broker.info[0]: start: none->join 1.43049ms
Jul 05 00:41:31.563934 UTC broker.info[0]: parent-none: join->init 0.055589ms
Jul 05 00:41:31.564583 UTC connector-local.debug[0]: allow-guest-user=false
Jul 05 00:41:31.564593 UTC connector-local.debug[0]: allow-root-owner=false
Jul 05 00:41:31.601250 UTC broker.debug[0]: insmod content
Jul 05 00:41:31.621869 UTC broker.debug[0]: insmod barrier
Jul 05 00:41:31.685188 UTC broker.debug[0]: insmod content-sqlite
Jul 05 00:41:31.687204 UTC content-sqlite.debug[0]: /tmp/flux-WCxcEG/content.sqlite (0 objects) journal_mode=OFF synchronous=OFF
Jul 05 00:41:31.687476 UTC content.debug[0]: content backing store: enabled content-sqlite
Jul 05 00:41:31.703794 UTC broker.debug[0]: insmod kvs
Jul 05 00:41:31.723703 UTC broker.debug[0]: insmod kvs-watch
Jul 05 00:41:31.776695 UTC broker.debug[0]: insmod resource
Jul 05 00:41:31.873651 UTC broker.debug[0]: insmod cron
Jul 05 00:41:31.873950 UTC cron.info[0]: synchronizing cron tasks to event heartbeat.pulse
Jul 05 00:41:31.896608 UTC broker.debug[0]: insmod job-manager
Jul 05 00:41:31.904948 UTC job-manager.debug[0]: jobtap plugin .history registered method job-manager.history.get
Jul 05 00:41:31.904985 UTC job-manager.debug[0]: jobtap plugin .post-event registered method job-manager.post-event.post
Jul 05 00:41:31.905055 UTC job-manager.debug[0]: housekeeping is not configured
Jul 05 00:41:31.905634 UTC job-manager.info[0]: restart: 0 jobs
Jul 05 00:41:31.905648 UTC job-manager.info[0]: restart: 0 running jobs
Jul 05 00:41:31.905830 UTC job-manager.info[0]: restart: checkpoint.job-manager not found
Jul 05 00:41:31.905852 UTC job-manager.debug[0]: restart: max_jobid=f1
Jul 05 00:41:31.926970 UTC broker.debug[0]: insmod job-info
Jul 05 00:41:31.946050 UTC broker.debug[0]: insmod job-list
Jul 05 00:41:32.002345 UTC broker.debug[0]: insmod job-ingest
Jul 05 00:41:32.002877 UTC job-ingest.debug[0]: configuring validator with plugins=(null), args=(null) (enabled)
Jul 05 00:41:32.003173 UTC job-ingest.debug[0]: fluid ts=1ms
Jul 05 00:41:32.017969 UTC broker.debug[0]: insmod job-exec
Jul 05 00:41:32.036132 UTC broker.debug[0]: insmod heartbeat
Jul 05 00:41:32.040963 UTC broker.info[0]: rc1.0: running /etc/flux/rc1.d/02-cron
Jul 05 00:41:32.246169 UTC broker.info[0]: rc1.0: /etc/flux/rc1 Exited (rc=0) 0.7s
Jul 05 00:41:32.246276 UTC broker.info[0]: rc1-success: init->quorum 0.682327s
Jul 05 00:41:32.346802 UTC broker.info[0]: online: builder1-x86.buildfarm.internal (ranks 0)
Jul 05 00:41:32.346837 UTC broker.info[0]: quorum-full: quorum->run 0.100546s
Jul 05 00:41:32.711420 UTC broker.debug[0]: insmod sched-fluxion-resource
Jul 05 00:41:32.711671 UTC sched-fluxion-resource.info[0]: version 0.36.0
Jul 05 00:41:32.711747 UTC sched-fluxion-resource.debug[0]: mod_main: resource module starting
Jul 05 00:41:32.712263 UTC sched-fluxion-resource.debug[0]: resource graph datastore loaded with rv1exec reader
Jul 05 00:41:32.712279 UTC sched-fluxion-resource.info[0]: populate_resource_db: loaded resources from core's resource.acquire
Jul 05 00:41:32.712333 UTC sched-fluxion-resource.debug[0]: resource status changed (rankset=[all] status=DOWN)
Jul 05 00:41:32.712342 UTC sched-fluxion-resource.debug[0]: resource status changed (rankset=[0] status=UP)
Jul 05 00:41:32.712347 UTC sched-fluxion-resource.debug[0]: mod_main: resource graph database loaded
Jul 05 00:41:32.729465 UTC broker.debug[0]: insmod sched-fluxion-qmanager
Jul 05 00:41:32.729840 UTC sched-fluxion-qmanager.info[0]: version 0.36.0
Jul 05 00:41:32.729972 UTC sched-fluxion-qmanager.debug[0]: service_register
Jul 05 00:41:32.730021 UTC sched-fluxion-qmanager.debug[0]: enforced policy (queue=default): fcfs
Jul 05 00:41:32.730037 UTC sched-fluxion-qmanager.debug[0]: effective queue params (queue=default): default
Jul 05 00:41:32.730045 UTC sched-fluxion-qmanager.debug[0]: effective policy params (queue=default): default
Jul 05 00:41:32.730270 UTC sched-fluxion-qmanager.debug[0]: handshaking with sched-fluxion-resource completed
Jul 05 00:41:32.730378 UTC job-manager.debug[0]: scheduler: hello
Jul 05 00:41:32.730554 UTC job-manager.debug[0]: scheduler: ready unlimited
Jul 05 00:41:32.730660 UTC sched-fluxion-qmanager.debug[0]: handshaking with job-manager completed
Jul 05 00:41:33.214612 UTC broker.debug[0]: rmmod job-ingest
Jul 05 00:41:33.214914 UTC broker.debug[0]: module job-ingest exited
Jul 05 00:41:33.216003 UTC broker.debug[0]: insmod job-ingest
Jul 05 00:41:33.216306 UTC job-ingest.debug[0]: configuring validator with plugins=feasibility, args=(null) (enabled)
Jul 05 00:41:33.216506 UTC job-ingest.debug[0]: fluid ts=1ms
The tests run clean on Ruby, and it's the x86 build that's failing, so maybe I could just disable the tests in the specfile for now?
That is distressing... you can, but I would really like to know what's going on. Clearly it aborted, but I don't see anything indicating why. Especially since it didn't happen on ruby.
It happens on a request of a single GPU that doesn't specify anything else... Maybe there's a corner case in the new one pre check code that gets tickled by that but only in some specific build mode? The build farm uses the rhel hardening flags, could be one of those catching a mistake.
@grondo, any tips for grabbing core files from the buildfarm?
In the past I had to add gdb as a build requires, find the path to the core, then run a back trace with a gdb script. Sorry, I'm out of town and can't assist until Sunday probably.
Another idea is to install valgrind and run the failing test with FLUX_TEST_VALGRIND=t, which may give a hint?
On Thu, Jul 4, 2024, 9:31 PM James Corbett @.***> wrote:
@grondo https://github.com/grondo, any tips for grabbing core files from the buildfarm?
— Reply to this email directly, view it on GitHub https://github.com/flux-framework/flux-sched/issues/1238#issuecomment-2210125135, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFVEUVOQWHI5GYU5J5UJ2DZKYOQNAVCNFSM6AAAAABKL6KRWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGEZDKMJTGU . You are receiving this because you were mentioned.Message ID: @.***>
That might help, or use the asan cmake preset. I'm going to try and repro by adding the RHEL hardening flags @jameshcorbett sent me the last time we were tracking an RPM issue, here for future record:
CFLAGS='-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection'
export CFLAGS
CXXFLAGS='-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection'
export CXXFLAGS
Good news is, with the hardening flags I'm getting the same two failures on ruby. Here's how I'm building, looking into it now:
module load gcc/12
cmake -B build -DCMAKE_CXX_FLAGS="-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection" -DCMAKE_C_FLAGS="-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection"
Oh great, that makes things easier.
Ok, this is a hardening flag catching something that's probably essentially always ok but isn't actually allowed. The by_subplan
pruning method unconditionally does &(aggs[0])
where aggs is a vector that might be empty. The flags turned on bounds checking for cases like this, and if it's empty, there is no element zero, so the bounds check aborted the process. The correct way to write this is to use aggs.data()
which just returns a null pointer in this case rather than dereferencing a null in a method and taking a reference to it afterward. I'm doing a quick grep around the code to see if this antipattern is hiding anywhere else, but the fix is trivial.
As expected, with the patch from #1239 the build succeeds.
v0.36.0 failed in the buildfarm:
Same failures occurred when I re-ran the build.
For
t1020
, all cases fromunsatisfiable jobs are rejected
(test case 5) onwards failed.For
t3034
, I only see one test-case failure:Pointers to debug?