Fix uninitialized cuda_hostbuf_used in ips_proto_mq_{send,isend}

cornelisnetworks / opa-psm2

Other

36 stars 29 forks source link

Fix uninitialized cuda_hostbuf_used in ips_proto_mq_{send,isend} #37

Closed hahnjo closed 4 years ago

hahnjo commented 5 years ago

The value is read in ips_proto_mq_eager_complete to determine if prefetched buffers from sendreq_prefetch must be reclaimed. If cuda_hostbuf_used is uninitialized, the value might be different from 0 which leads to crashes if sendreq_prefetch hasn't been set up correctly.

From what I can see there are 4 possible call stacks that lead to that function: 1&2) Via ips_ptl_mq_eager which is called from both ips_protomq{send,isend} if len <= mq->hfi_thresh_rv. Note that this doesn't use prefetch buffers, so I think it's correct to set cuda_hostbuf_used = 0. 3) In ips_proto_mq_isend if len <= flow->frag_size. Again this doesn't use prefetch buffers, so the default for the request in ips_proto_mq_isend can be cuda_hostbuf_used = 0. 4) ips_proto_mq_push_rts_data which is involved if len > mq->hfi_thresh_rv. In that case ips_ptl_mq_rndv correctly sets cuda_hostbuf_used = {0,1}.

hahnjo commented 5 years ago

Note that I'm not sure if ips_proto_mq_eager_complete is ever called from ips_ptl_mq_rndv / ips_proto_mq_push_rts_data: I've tried to understand the code structure and trigger that code path, but haven't been successful. If that can never happen, the better solution would be to remove the PSM_CUDA from ips_proto_mq_eager_complete which just avoids the problem.