cornelisnetworks / opa-psm2

Other
36 stars 29 forks source link

Fix uninitialized cuda_hostbuf_used in ips_proto_mq_{send,isend} #37

Closed hahnjo closed 4 years ago

hahnjo commented 5 years ago

The value is read in ips_proto_mq_eager_complete to determine if prefetched buffers from sendreq_prefetch must be reclaimed. If cuda_hostbuf_used is uninitialized, the value might be different from 0 which leads to crashes if sendreq_prefetch hasn't been set up correctly.

From what I can see there are 4 possible call stacks that lead to that function: 1&2) Via ips_ptl_mq_eager which is called from both ips_protomq{send,isend} if len <= mq->hfi_thresh_rv. Note that this doesn't use prefetch buffers, so I think it's correct to set cuda_hostbuf_used = 0. 3) In ips_proto_mq_isend if len <= flow->frag_size. Again this doesn't use prefetch buffers, so the default for the request in ips_proto_mq_isend can be cuda_hostbuf_used = 0. 4) ips_proto_mq_push_rts_data which is involved if len > mq->hfi_thresh_rv. In that case ips_ptl_mq_rndv correctly sets cuda_hostbuf_used = {0,1}.

hahnjo commented 5 years ago

Note that I'm not sure if ips_proto_mq_eager_complete is ever called from ips_ptl_mq_rndv / ips_proto_mq_push_rts_data: I've tried to understand the code structure and trigger that code path, but haven't been successful. If that can never happen, the better solution would be to remove the PSM_CUDA from ips_proto_mq_eager_complete which just avoids the problem.

mwheinz commented 4 years ago

Sorry for ignoring this for so long; I was on sabbatical and didn't see it till today. We are reviewing your change and will look at pulling it ASAP.

mwheinz commented 4 years ago

Will be included in the next PSM2 release.

hahnjo commented 4 years ago

As far as I can see, this fix is still not in master. Do you actually for people debugging and fixing your stuff?!?

mwheinz commented 4 years ago

We haven't actually done another release since I said we would include it in the next release...

hahnjo commented 4 years ago

We haven't actually done another release since I said we would include it in the next release...

So what's a "release"? As a contributor I only see two commits in master that claim to sync with a release.

mwheinz commented 4 years ago

I see 48 releases on https://github.com/intel/opa-psm2/releases

hahnjo commented 4 years ago

I see 48 releases on https://github.com/intel/opa-psm2/releases

And https://github.com/intel/opa-psm2/releases/tag/IFS_RELEASE_10_10_1_0_36 is from 11 days ago and doesn't have the fix.

mwheinz commented 4 years ago

Hahnjo, this isn't the best place for this conversation. Please email me @ michael.william.heinz@intel.com so we can talk this through.