Closed jjsjann123 closed 1 year ago
review comments have been addressed. running CI locally and will merge the PR afterwards.
I am seeing failing CIs, but I don't think they are relevant. I'm merging this one.
@zasdfgbnm @jacobhinkle could we revisit this approach now that we have allocation domains?
Warning this is the csarofeen/pytorch repo.
Added new python API
fd.ops.add_output(tensor, stride_order)
, wherestride_order
means that output axisi
is thestride_order[i]
th fastest dimension.e.g. if we want to specify output to be in channel-last format, we should specify
fd.ops.add_output(tensor_view, [0, 3, 1, 2])
, where a given output with shape[N, C, H, W]
will have stride[H*W*C, 1, W*C, C]
Implementation details: It's currently done in a naive way. Since nvfuser doesn't support user specified stride order yet, we fake it by: