flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

Add smart pointer support and misc. cleanup #537

Closed dongahn closed 4 years ago

dongahn commented 4 years ago

This PR adds smart pointer support using std::shared_ptr to:

This PR also adds a few other cleanups:

dongahn commented 4 years ago

@SteVwonder and @milroy: I added you as the reviewers for this.

Although this PR spans multiple files and many lines, a review should be relatively painlessly. This essentially replaces raw pointers to shared_ptr objects: mostly drop-in replacements. But note that there are a few places like "destroy" logic for resource contexts within resource_match.cpp and resource-query.cpp, which would need some good review, though.

Others are style changes, log message changes and applying a bit better practice of using explicit namespace paths within std.

dongahn commented 4 years ago

Two Travis test instances failed. Let me look at them before your reviews.

dongahn commented 4 years ago
qmanager: loading resource and qmanager modules works

expecting success:
    jobid=$(flux job submit basic.json) &&
    flux job wait-event -t 2 ${jobid} start &&
    flux job wait-event -t 2 ${jobid} finish &&
    flux job wait-event -t 2 ${jobid} release &&
    flux job wait-event -t 2 ${jobid} clean

flux-job: flux_open: Connection reset by peer
not ok 4 - qmanager: basic job runs in simulated mode

expecting success:
    jobid=$(flux jobspec srun -t 1 hostname | \
        exec_test | flux job submit) &&
    flux job wait-event -vt 2.5 ${jobid} start &&
    flux job cancel ${jobid} &&
    flux job wait-event -t 2.5 ${jobid} exception &&
    flux job wait-event -t 2.5 ${jobid} finish | grep status=15 &&
    flux job wait-event -t 2.5 ${jobid} release &&
    flux job wait-event -t 2.5 ${jobid} clean &&
    exec_eventlog $jobid | grep "complete" | grep "\"status\":15"

flux-job: flux_open: No such file or directory
not ok 5 - qmanager: canceling job during execution works

flux-job: flux_open: No such file or directory

This probably means the flux instance crashed...?

dongahn commented 4 years ago

$ src/test/backtrace-all.sh

This doesn't report anything though...

grondo commented 4 years ago

Yeah, unfortunately it has been a long standing problem that Travis can't generate corefiles in docker images.

dongahn commented 4 years ago

@grondo: ah...

Thankfully I just reproduced this on my docker using clang-6.0!

grondo commented 4 years ago

Whew, otherwise the approach is typically a series of printf debugging via pushes to get a Travis build. Not fun!

codecov-io commented 4 years ago

Codecov Report

Merging #537 into master will decrease coverage by 0.44%. The diff coverage is 68.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #537      +/-   ##
==========================================
- Coverage   76.03%   75.59%   -0.45%     
==========================================
  Files          69       69              
  Lines        6481     6515      +34     
==========================================
- Hits         4928     4925       -3     
- Misses       1553     1590      +37
Impacted Files Coverage Δ
resource/policies/dfu_match_policy_factory.hpp 100% <ø> (ø) :arrow_up:
resource/readers/resource_reader_hwloc.hpp 100% <ø> (ø) :arrow_up:
resource/readers/resource_reader_grug.hpp 100% <ø> (ø) :arrow_up:
resource/writers/match_writers.hpp 100% <ø> (ø) :arrow_up:
resource/utilities/command.hpp 100% <ø> (ø) :arrow_up:
resource/libjobspec/jobspec.hpp 100% <ø> (ø) :arrow_up:
resource/store/resource_graph_store.hpp 100% <ø> (ø) :arrow_up:
resource/readers/resource_reader_jgf.hpp 100% <ø> (ø) :arrow_up:
resource/readers/resource_reader_base.hpp 100% <ø> (ø) :arrow_up:
resource/traversers/dfu_impl.hpp 100% <ø> (ø) :arrow_up:
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8342b7d...fd9207a. Read the comment docs.

dongahn commented 4 years ago

Ugg... Docker hub service availability issue?

Building image bionic for user travis 2000 group=20000.50s
482Sending build context to Docker daemon   2.56kB
483Step 1/9 : FROM fluxrm/flux-core:bionic
484received unexpected HTTP status: 503 Service Unavailable
485docker-run-checks.sh: docker build failed
grondo commented 4 years ago

Ugg... Docker hub service availability issue?

Looks that way. Just try restarting the builder...

dongahn commented 4 years ago

@grondo: OK. will restart.

FYI -- Seems the issue was that shared_ptr C++ smart pointers don't directly translate to C-style void *-based callbacks.

You can't directly pass a shared_ptr object to void * as if it is a raw pointer.

You can take the address of a shared_ptr object and pass this raw pointer into void * so that you can later retrieve and cast it back to share_ptr on each callback invocation. But the caveat is, you have to make sure that this shared_ptr object must be valid all the time. I was passing a temporary shared_ptr object into flux_aux_set which then led to an undefined behavior.

I moved flux_aux_set from getctx to mod_main and passed its resource context shared_ptr into this API. This should be ok as each object in mod_main's stack frame should be valid as far as the resource module is active.

We'll see if Travis likes it this time.

dongahn commented 4 years ago

This should be good to review now. Code coverage failures should be ignored. I glanced at them and the coverage slightly decreased because I add more lines in the exceptional conditions as part of code cleanup and use of C++ exceptions...

dongahn commented 4 years ago

FYI -- Seems the issue was that shared_ptr C++ smart pointers don't directly translate to C-style void *-based callbacks.

@trws and I discussed this a bit after our meeting. This is a generic problem for C++ code interfacing with C callbacks in flux-core. I will create at a ticket so that @trws can propose something to solve this generally. The current code should serve its purpose in the meantime.

SteVwonder commented 4 years ago

@trws and I discussed this a bit after our meeting. This is a generic problem for C++ code interfacing with C callbacks in flux-core. I will create at a ticket so that @trws can propose something to solve this generally. The current code should serve its purpose in the meantime.

@dongahn : at the end of #538, you mentioned that you would consider @trws' suggestion with a fresh set of eyes. Do you plan to add that suggestion to this PR? Or in a separate PR? Just wondering if I should review this as-is or wait for more updates.

dongahn commented 4 years ago

@SteVwonder: Thank. I thought about this and it would be best to not to add shared pointer for the context piece. Because the changes will be substantial, it would be best to close this and do another PR.

@milroy: I will look at your review and address that in a new PR as well.

dongahn commented 4 years ago

@SteVwonder: Thank. I thought about this and it would be best to not to add shared pointer for the context piece. Because the changes will be substantial, it would be best to close this and do another PR.

Unfortunately, backing this up will be too disruptive for this PR and my other PR (#543). My proposal is to update this PR with minor changes requested by @milroy and merge it in so that we can quickly get to #541

C/C++ compatibility will be handled in a more robust way with @trws' PR #542.

dongahn commented 4 years ago

@milroy: I add two additional commits that address your review points. If you are okay with it and @SteVwonder has no other comments, this can go in after squashing these new commits. Let me know.

milroy commented 4 years ago

@milroy: I add two additional commits that address your review points. If you are okay with it and @SteVwonder has no other comments, this can go in after squashing these new commits. Let me know.

@dongahn I've looked over your responses and this PR is ready to merge as far as I'm concerned.

dongahn commented 4 years ago

@dongahn I've looked over your responses and this PR is ready to merge as far as I'm concerned.

Sounds good. I will squash those two latest commits and I think this can go in.

dongahn commented 4 years ago

I will squash those two latest commits and I think this can go in.

@milroy: those commits are squashed. Once Travis CI is happy, this can be merged.

SteVwonder commented 4 years ago

Merging since @milroy and I have both reviewed and approved (annoyingly I cannot click the “approve” button on github mobile).

Thanks @dongahn!

dongahn commented 4 years ago

Thank you!

SteVwonder commented 4 years ago

I lied. Apparently I cannot merge a PR on mobile when there is a failing test (coveralls, which we said was ok).

If no one else beats me to it, I’ll click the button tomorrow.

SteVwonder commented 4 years ago

Thanks @trws and @dongahn!