Closed yzh119 closed 1 month ago
we observed that the multiple cudaMemcpyAsync still incurs non-negligible overhead in BeginForward functions, this PR accelerates BeginForward by:
cudaMemcpyAsync
BeginForward
cc @tqchen
we observed that the multiple
cudaMemcpyAsync
still incurs non-negligible overhead inBeginForward
functions, this PR acceleratesBeginForward
by:cudaMemcpyAsync
that copies all page data structure.cc @tqchen