Output stride order - Githubissues

jjsjann123 commented 1 year ago

Added new python API fd.ops.add_output(tensor, stride_order), where stride_order means that output axis i is the stride_order[i]th fastest dimension.

e.g. if we want to specify output to be in channel-last format, we should specify fd.ops.add_output(tensor_view, [0, 3, 1, 2]), where a given output with shape [N, C, H, W] will have stride [H*W*C, 1, W*C, C]

Implementation details: It's currently done in a naive way. Since nvfuser doesn't support user specified stride order yet, we fake it by:

adding a permute op on outputs inside the generated kernel, to ensure that the output is stored in the correct memory layout;
after the kernel has executed, we permute that corresponding output to undo the permutation inside the kernel, this gives us the semantically correct output in the desired memory layout.

jjsjann123 commented 1 year ago

review comments have been addressed. running CI locally and will merge the PR afterwards.

jjsjann123 commented 1 year ago

I am seeing failing CIs, but I don't think they are relevant. I'm merging this one.

csarofeen commented 9 months ago

@zasdfgbnm @jacobhinkle could we revisit this approach now that we have allocation domains?

csarofeen commented 9 months ago

Warning this is the csarofeen/pytorch repo.

csarofeen / pytorch

Output stride order #2548