GetLayout() ambiguous while parsing tensor layout if a dimension is 1

carlushuang commented 3 years ago

As in https://github.com/ROCmSoftwarePlatform/MIOpen/pull/765, function TensorDescriptor::GetLayout() will produce wrong result if a single dimension is 1.

e.g. if a depthwise weight length is [63, 1, 3, 3], and we want to calculate the stride of NHWC layout, which is [9, 1, 3, 1]. The output of GetLayout() will be NHCW which is not wanted. This is still true in 3d weight [32,1,3,5,20], the stride of NDHWC will be [300,1,100,20,1] , which will produce wrong layout string as NDHCW.

~However, under this case, the stride of either NDHCW or NDHWC layout is indeed [300,1,100,20,1], the C and W dimension seems exchangeable. So in such case (C=1) if we only look at the stride, it will have ambiguous meaning, not only one layout is valid.~

stride of [9, 1, 3, 1] and length of [63, 1, 3, 3] should be NHWC layout not NHCW layout. And again, if we have multiple dimension of 1, will lead to ambiguous layout result

However the ambiguous case would be very common if in 1x1 filter case.

e.g. wei length: [32, 3, 1, 1]

we want NHWC -> stride [3, 1, 3, 3] -> valid layout could be NHWC/NWHC (2 ambiguous)
we want CHWN -> stride [32, 32, 32, 1] -> valid layout would be CHWN/CWHN (2 ambiguous)

wei length: [32, 1, 1, 1] (depth wise 1x1 filter)

we want NHWC -> stride [1, 1, 1, 1] -> valid layout would be NCHW/NCWH/NHWC/NHCW... (6 ambiguous)
we want CNHW -> stride [1, 32, 1, 1] -> valid layout would be CNHW/CNWH (2 ambiguous)

cc @jerryyin @pfultz2

jerryyin commented 3 years ago

It's unfortunate that we didn't realize this in #343.

Honestly I don't see a way out when we have (any) one dimension as 1. When that case happens, to derive layout from strides is not possible, because at least two strides will be the same. It looks to me the only way out is to revert #343 to adopt a new API for accepting string as layout, and store it as-is. Let me know what your thoughts are.

pfultz2 commented 3 years ago

Instead, we could have a function which checks if it is in the layout(ie IsLayout("NCHW", "NHWC")) rather or we could return a vector of different layout strings, and then pick the best one for the solver.

Do we have different solvers for different layouts?

It looks to me the only way out is to revert #343 to adopt a new API for accepting string as layout, and store it as-is. Let me know what your thoughts are.

I dont think we should make our API ambiguous, or put more burden on the user to figure out which layout the tensor is. MIOpen knows which kernels it can run with which layouts so it has the best insights to figure out which combinations of kernels matches with which layout.(User-level applications do not have that insight).

OneDNN, for example, only takes layout using strides, so it possible to pick the best algorithm based on strides-only.

jerryyin commented 3 years ago

we could have a function which checks if it is in the layout(ie IsLayout("NCHW", "NHWC"))

This is a good idea for Carlus example, but what about the case such as C=H=W=1, then strides for both NCHW and NHWC = [1, 1, 1, 1] (same situation for input, filter and output, so it is hard to tell.). At this case, how do we differentiate between the two?

Do we have different solvers for different layouts?

We are about to. Going forward, we might have solver only support NCHW, only support NHWC or support both.

With igemm kernels, we have the capability to support mixed layout as well: for example, input NCHW filter CHWN output NHWC

I dont think we should make our API ambiguous, or put more burden on the user to figure out which layout the tensor is.

I agree, I think it is more elegant to put layout information in strides too. But I just don't see how to derive the layout correctly for certain extreme cases.

pfultz2 commented 3 years ago

At this case, how do we differentiate between the two?

We can treat it as whichever layout we want as long it will pick the fastest kernel for it.

We are about to. Going forward, we might have solver only support NCHW, only support NHWC or support both.

But if the input tensor can be interpreted as either NCHW or NHWC, then how do we pick the fastest algorithm? We would need this as a some kind of tuning parameter if its in the same solver. As seperate solvers, the find command will figure out which is faster, but that may lead to too many solvers.

With igemm kernels, we have the capability to support mixed layout as well: for example, input NCHW filter CHWN output NHWC

But will igemm treat NCHW as NHWC when its faster? Does igemm take the layout as strides? It should. If the output tensor is different it shouldn't change the algorithm, its just a matter of computing the strides(because writes are never "cached" like inputs are).

Because even if the user passes the tensor intentionally as NCHW, if its possible to run it as NHWC(and its faster) then we should interpret it as NHWC instead. Adding a layout string to the tensor would prevent this kind of optimization.

I agree, I think it is more elegant to put layout information in strides too. But I just don't see how to derive the layout correctly for certain extreme cases.

Well all of our kernels that will handle multiple layouts at once should be built generically so it can handle different strides without needing to know if its NHWC or NCHW.

jerryyin commented 3 years ago

We can treat it as whichever layout we want as long it will pick the fastest kernel for it. Well all of our kernels that will handle multiple layouts at once should be built generically so it can handle different strides without needing to know if its NHWC or NCHW.

Thanks for reminding me on this. When a stride lead to arbitrary layouts, it does indeed not matter which of the layout it get interpreted as. Implicit gemm algorithm does rely on stride on do its coordinate transforms. It should make no difference which layout we treated it as.

But if the input tensor can be interpreted as either NCHW or NHWC, then how do we pick the fastest algorithm?

This is a good one. Though I don't see much problem configs that can actually fall into this category. I guess we can default to NCHW for now because we have tuning parameters for it. In the future, when we do tuning based on a specific set of layout, the runtime information should exist on both applicable layout.

its just a matter of computing the strides

I think so, but just double ping @asroy here.

atamazov commented 3 years ago

I anticipated that problem with 1's.

jerryyin commented 3 years ago

So according to discussion above, I propose the following fix:

ComputeLayout() and GetLayout() both return a vector of strings. Each string represent a potential layout for the data
When problem config needs to serialize the input/output/weight layout, prioritize in the follow sequence:
- Highest priority: If any tensor has only one layout candidate, use that layout candidate eg: Input is NCHW, weight is NCHW or NHWC; then use NCHW for weight.
- NCHW (because we have largest number of solvers)
- NHWC (because we are about to add support for solvers to be able to handle this)
- The first entry, if not any of the above

If we are on the same page with this, I will proceed with this fix. @carlushuang, please let me know if this solution can help fix the issue you see.

carlushuang commented 3 years ago

Hi @jerryyin @pfultz2 I just realize that my analysis is not correct. At least if one dimension is 1, should have no ambiguous of GetLayout. I updated the description of this issue. If multiple dimension is 1 then would have ambiguous layout.

I guess still the problem is just GetLayout() will not compute correctly if one dimension is 1.

jerryyin commented 3 years ago

@carlushuang

if a depthwise weight length is [63, 1, 3, 3], and we want to calculate the stride of NHWC layout, which is [9, 1, 3, 1]. The output of GetLayout() will be NHCW which is not wanted.

This is a bit confusing and I'm not sure everyone get the same message. Let me rephrase your example to make sure everyone is in the same page:

If a depthwise weight length is [63=N, 1=C, 3=H, 3=W], we know the following:

for NHWC layout, the strides for each dimension: N: 9, C: 1, H: 3, W: 1
for NHCW layout, the strides for each dimension: N: 9, C: 3, H: 3, W: 1 This indicate that the two layout will yield on a different access pattern if we are to grab the data accordingly. Note that C dimension has different strides if data is interpreted as different layout.

Now, given a set of strides without telling its layout: [9, 1, 3, 1], the derived potential layout can be either: NHWC or NHCW. In either case, the strides for C and W are both 1. If we do grab data according to NHCW layout as an example, and treat C's stride as 1, we will end up grabbing the wrong values because C's stride should be 3. This is one example where just using the stride information give us wrong results. Returning a potential list of layout as candidates does not fix it.

Considering the layout for filter may run into it most. We can ignore the filter layout and just use input and output's layout as filter's (only when input and output have same layout in the potential layout list). It is less likely for input/output dimension to be 1, and even less likely/impossible for both input and output layout to derive the wrong result.

Let me give an example, for a conv of: NCHW (64 x 3 x 32 x 32) x NCHW (32 x 3 x 1 x 1) = NCHW (64 x 32 x 32 x 32)

1x1 filter strides is common and can yield multiple interpretations
However, input and output are non-ambiguous. By using the same layout as input and output, we can derive the right layout for filter as well.

carlushuang commented 3 years ago

If a depthwise weight length is [63=N, 1=C, 3=H, 3=W], we know the following: for NHWC layout, the strides for each dimension: N: 9, C: 1, H: 3, W: 1

@jerryyin yes, your explain is more clear. and this case it should return NHWC but know it return NHCW I'm looking at the code of GetLayout() to see why it will return a NHCW layout information. At least single dimension is 1 should be OK.

My first glace is that it only deal with strides member. if we parse [9, 1, 3, 1] into sort_permutation(), it will return result [0, 2, 1, 3] (aka NHCW) which is not correct. std::sort may not suitable to deal with 2 element with the same value.


    template <class Vector, class Op>
    static inline std::vector<int64_t> sort_permutation(const Vector& data, Op op)
    {
        std::vector<std::int64_t> result(data.size());
        std::iota(result.begin(), result.end(), 0);
        std::sort(
            result.begin(), result.end(), [&](auto x, auto y) { return op(data[x], data[y]); });
        return result;
    }

    std::string GetLayout(std::string labels) const
    {
        if(labels.size() != strides.size())
        {
            MIOPEN_THROW(
                "Invalid labels size. Layout labels size must be equavalent to stride size");
        }
        // Copy construct the result string from labels. This allocates the space at one go
        // and is faster than calling push_back in transform.
        auto result = labels;
        auto p      = sort_permutation(strides, std::greater<>{});
        std::transform(p.begin(), p.end(), result.begin(), [&](auto i) { return labels[i]; });
        return result;
    }
 private:
    std::vector<std::size_t> lens;
    std::vector<std::size_t> strides;

carlushuang commented 3 years ago

With an observation that there would be 2 item in stride vector with the same value ( e.g. [63, 1, 3, 3] weight will have a NHWC stride: [9, 1, 3, 1]), we are likely to swap the index of same stride value after std::sort() to keep layout correct.

So an ugly but workable WA of sort_permutation() for a single dimension is 1 case could be:

    template <class Vector, class Op>
    static inline std::vector<int64_t> sort_permutation(const Vector& lens, const Vector& strides, Op op)
    {
        std::vector<std::int64_t> result(strides.size());
        std::iota(result.begin(), result.end(), 0);
        std::sort(
            result.begin(), result.end(), [&](auto x, auto y) { return op(strides[x], strides[y]); });

        auto p = std::find(lens.begin(), lens.end(), 1);    // find first occurance of dim 1, aka p
        if(p != lens.end()){
            size_t index_p = p - lens.begin();
            for(auto q =  strides.begin(); q != strides.end(); q ++){
                size_t index_q = q - strides.begin();
                if(*q == strides[index_p] && index_q != index_p){
                    // now we found the stride with the same value as dim 1, aka q
                    // hence conditionally swap p & q in their *result* array
                    auto rp = std::find(result.begin(), result.end(), index_p);
                    auto rq = std::find(result.begin(), result.end(), index_q);
                    if(rp < rq)
                        std::iter_swap( rp, rq);
                    break;
                }
            }
        }
        return result;
    }

This can fix the problem in https://github.com/ROCmSoftwarePlatform/MIOpen/pull/765 And for 1x1 filter, the H and W will keep unchanged, as long as we never have layout like ?WH?/??WH Howevery like previous said, if there are 2 and more dimensions is 1, it have no 1<->1 mapping of stride<->layout, hence I think unless provide more information, the sort_permute() approach is not workable. Need your feedback @jerryyin @pfultz2 @atamazov

jerryyin commented 3 years ago

@carlushuang Thanks for brainstorming the issue here. I think we are facing a dilemma in which:

If we use new APIs to pass additional layout information, we can clear this issue once and for all.
- However, this is behavior change to client
- This is partly redundant information compared to what strides can provide
Instead if we use the existing API to resolve the issues we see, we may have to use one assumption or another.
- The assumption we make now may jeopardize the long term goal of being able to execute truly flexible layouts.

I have scheduled a meeting including @whchung and @pfultz2 and hopefully we can arrive at a consensus in that meeting. I don't think the time of meeting is friendly for you to attend, but will include you anyways.

atamazov commented 3 years ago

Public API change doesn't seem acceptable, but I could be wrong.

However we can change only internal APIs to ones that do not allow ambiguities. When ambiguous tensor descriptors arrive from the public API, we can resolve them to internal, unambiguous values. Such a removal of ambiguities will most likely be based on some heuristic.

I think at this point the heuristic should prefer NCHW, then NHWC (in that order), and then don't care.

Note that in case of convolutions, the heuristics need to "know" all three tensor descriptors. Otherwise we can run into cases where different layouts unnecessarily mixed, like x.NCHW, y.NHWC, w.NHWC.

jerryyin commented 3 years ago

WIP today.

@carlushuang clarification:

wei length: [32, 3, 1, 1] stride [3, 1, 3, 3] -> valid layout could be NHWC/NWHC (2 ambiguous)

The actual ambiguous situations include: Permutationof{NHW} + C = 6 cases in total: NHWC, NWHC, WNHC, WHNC, HNWC, HWNC. The ones you have NHWC/NWHC are the ones that are coherent with the lens.

@atamazov clarification:

I gave #765 a review and find out that some use case that @carlushuang have in the PR does not have the knowledge of all three tensors. For example, in the ctest, to construct the output tensor from input and filter tensor, you at most have the layout of 2 tensors, not all 3 (In the actual condition he only used the input layout). Given this situation, I think making the heuristic derive the layout from single tensor will make the use case most flexible. I'm half way on the following algorithm:

Do a depth first search to generate all possible layout
Examine every possible layout by checking if dimension generated strides can match the source stride. This way we can eliminate the "wrong" ambiguous results. Refer to the example above, only 2 out of 6 are valid cases.
Pick "NCHW" if it exists in possible results. Otherwise, pick "NHWC". Or else, pick the first possible layout. In the above example, NHWC will be picked because it has higher priority over the 2 valid cases.

carlushuang commented 3 years ago

get_output_tensor @jerryyin indeed in ctest, if not specify output layout by --out_layout, then just use the same as input layout, other wise just use the layout from --out_layout. And yes the algorithm you described should workable for #765

jerryyin commented 3 years ago

@atamazov @pfultz2 It turns out that the best place for me to derive the layout is in tensor descriptor construction time. Then the tensor descriptor can cache the result for later use.

I know that Paul has suggested we donot store layout in tensor descriptor itself but deriving the layout every time in the get function can be very expensive when it is an ambiguous layout. Therefore the need to cache it.

Please do a quick review on this solution. Since @carlushuang's PR will be based on mine, I want to avoid last minute surprises.

atamazov commented 3 years ago

@jerryyin What will happen if input tensor and weighs have both NCHW and NHWC among valid layouts, but output tensor has only NHWC?

I recommend computing layouts during construction of ProblemDescription (and storing them in it). All three tensors are available and some additional heuristic may be implemented (so the case above could be resolved to all-NHWC instead of NCHW-NCHW-NHWC, if necessary).

jerryyin commented 3 years ago

@atamazov From a stride stand of point:

Stride of NCHW: [CHW, HW, W, 1]
Stride of NHWC: [HWC, 1, WC, C]

The two can only be equivalent when H==W==C==1. It is not possible for input and weight to be both NCHW and NHWC unless they are both [N, 1, 1, 1]. On the other hand, if they are [N, 1, 1, 1], they will be treated as NCHW in the current heuristic, and we are good.

I agree that addressing it in problem description is simpler overall. But the problem with knowing all three tensors to calculate the result being that we have to design more than one heuristic. And when the derived layout are different between different heuristic, we have a problem again:

What if we only know about layout of one tensor (which is ambiguous) and use it to derive the other layout. @carlushuang has this use case. In ctest, we don't have all tensors available.
What if two ...

An equivalent question would be to @carlushuang: If everyone else has a strong enough point of view not to store layout in the tensor itself, but to decide it in problem description. Would you be able to deterministically determine the ambiguous layout in your ctest implementation?

carlushuang commented 3 years ago

@jerryyin the ctest use the same approach as other places of MIOpen to get the layout information, currently is just GetLayout() function, this is I think a internal API implementation. So if ctest can't determine ambiguous layout, so do other places in MIOpen would fail to determine.

jerryyin commented 3 years ago

Summary of the ticket

Between me, @pfultz2 and @atamazov, we worked through a number of angles to address this problem. This issue is now pending approval with https://github.com/ROCmSoftwarePlatform/MIOpen/pull/765

Derive ambiguous layout for single tensor

We decided to address this with sorting the pair of strides and lens, implemented in commit 3ead3a3.

Before this we have a few premature attempts including 2b80369, and a subroutine from @pfultz2 to calculate minimal adjacent distance. The discussion make us realize the need to include lens to figure our correct ambiguous layouts.

Derive ambiguous layout for multiple tensors

We ended deciding to implement d19f0b5, originally suggested by @pfultz2 and @atamazov. The heuristic set all tensors to coherent layout if it find out any.

jerryyin commented 3 years ago

Closing the ticket as #765 merged.

atamazov commented 3 years ago

This ticket has a lot of reasoning/rationale related to design that may be handy in the future. Let's keep it open, as a part of our internal documentation.

ROCm / MIOpen