csarofeen / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
26 stars 7 forks source link

Output stride order #2548

Closed jjsjann123 closed 1 year ago

jjsjann123 commented 1 year ago

Added new python API fd.ops.add_output(tensor, stride_order), where stride_order means that output axis i is the stride_order[i]th fastest dimension.

e.g. if we want to specify output to be in channel-last format, we should specify fd.ops.add_output(tensor_view, [0, 3, 1, 2]), where a given output with shape [N, C, H, W] will have stride [H*W*C, 1, W*C, C]

Implementation details: It's currently done in a naive way. Since nvfuser doesn't support user specified stride order yet, we fake it by:

  1. adding a permute op on outputs inside the generated kernel, to ensure that the output is stored in the correct memory layout;
  2. after the kernel has executed, we permute that corresponding output to undo the permutation inside the kernel, this gives us the semantically correct output in the desired memory layout.
jjsjann123 commented 1 year ago

review comments have been addressed. running CI locally and will merge the PR afterwards.

jjsjann123 commented 1 year ago

I am seeing failing CIs, but I don't think they are relevant. I'm merging this one.

csarofeen commented 9 months ago

@zasdfgbnm @jacobhinkle could we revisit this approach now that we have allocation domains?

csarofeen commented 9 months ago

Warning this is the csarofeen/pytorch repo.