apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.78k stars 3.47k forks source link

[RFC] Data-flow Analysis Functionality on TVM IR #4468

Closed DKXXXL closed 2 years ago

DKXXXL commented 4 years ago

Problem

When developing program passes on TVM IR (the one once was Halide IR), it is normal to ask for all sorts of information requiring program analysis, for example, live variable analysis for dead code elimination. This requirement becomes urgent when TVM has to directly issue intrinsic and the subsequent processing stage (for example LLVM) cannot analyze the program because of these intrinsic.

Consideration

We first start with a concrete example -- Live Variable Analysis.

Live Variable Analysis

I will take Live Variable Analysis as an example. I will define a functor, inputting an AST and the set of live variable after the AST is executed, outputting the set of live variable before the AST is executed.

class LiveAnalysisFunctor : public ExprFunctor<set(const Expr&, const set&)>,
                             public StmtFunctor<set(const Stmt&, const set&)>

This functor can be easily defined inductively on the AST, demonstrated in pseudo-code

-- the function below taking AST and post-live-variables
LiveAnalysisFunctor (Block{stmtX,stmtY}) postAlive = 
     let intermediateAlive = LiveAnalysisFunctor stmtY postAlive
     in LiveAnalysisFunctor stmtX intermediateAlive

In other words, we propagate up the live variable from the end to the beginning, according to the execution order of TVM IR

When encountering if-statement, we do similarly

LiveAnalysisFunctor (IfStmt{cond,branchThen, branchElse}) postAlive = 
     let beforeThen = LiveAnalysisFunctor branchThen postAlive
          beforeElse   = LiveAnalysisFunctor branchElse postAlive
          afterConditional = Union(beforeThen, beforeElse)
     in LiveAnalysisFunctor cond afterConditional

The computation of for-statement is the only that requiring fixpoint computation.

Also, we know at the endpoint, the live variables will be empty.

Several (only relevant to live variable analysis) trivial fact:

Memorization

You can see that in the current situation, to have the live variable information at an arbitrary point, we need to calculate from very end to the point (and if that point is inside a for-loop, we need to make sure the fixpoint is reached). Thus it is suggested we only compute one time and saving the result. Fortunately, the above functor is pure, thus we can try to cache the above functor into a standard dictionary/map data structure, and thus an AST and the set of live variable as key, and the set of live variable before the AST as value.

This memorization strategy should be able to apply to any pure IRFunctor:

template <functor>
class MemorizedVersion<functor> : functor

During the computation of the liveness of the very beginning of the program, every node in the whole program will be visited by the Functor and thus will be cached into the dictionary.

However, we may only need the information when fixpoint is reached, which means we can keep forgetting the old value every time a new pair of AST and PostAlive is computed while the AST has already once inserted into the dictionary with other PostAlive'.

Ultimately, we want to transform the above dictionary into another dictionary taking AST as key, and the live variable before executing AST and live variable after executing AST as value. Of course, this dictionary has a clear specification that "only when the end of the whole program has empty set as live variables, each point in the AST has the live variable set as indicated in the dictionary".

Also note that, it is possible to query everywhere in the program about the Liveness Infomation, as long as the Functor is properly defined for every AST and that point is at the entry or exit point of some AST node.

Proposal

We generalize the above functor into forward/backward data-flow analysis 'framework' to remove boilerplate code as much as possible

// forward, functional, DFA
template <class lattice,        // the type lattice, also the prestate/poststate type
      class ltProof,      // the proof that lattice is a lattice, a subtype of IsLattice<lattice>
      const ltProof& ltOp> // lattice Operations, bind the operation at compile time
class ffdfa : public ExprFunctor<lattice(const Expr&, const lattice&)>,
              public StmtFunctor<lattice(const Stmt&, const lattice&)> {
  static_assert(std::is_base_of<IsLattice<lattice>, ltProof>::value, 
                              "ltProof must be a class derived from IsLattice<>");
...
}

template<class T>
class IsLattice { // classical lattice definition, T is lattice under the following operation
  public:
    // a least upperbound of two
    T join(const T& a, const T& b) = 0;
    // a greatest lower bound of two
    T meet(const T& a, const T& b) = 0;
    // highest
    const T& top() = 0;
    // whole set
    const T& bottom() = 0;
    bool equal(const T& a, const T& b) = 0;
};

I will indicate the usage of IsLattice in the comment section. This part (the way defining lattice) is actually quite fluid and more about the programming style.

There will also be another class bfdfa, which propagate value on backwards and exactly a generalization of Live Variable Analysis.

Memorization part:


template <typename R, // Return type of the functor that want to be memorized
            typename KType,  // key type of the dictionary
            MemorizingPolicy m, // currently only have policy 'override'
            typename ParentFunctor, // the functor that want to be memorized
            typename ...Args, 
            typename ...InitArgs> // initialization argument to used when initializing the dictionary
class RecordingExprFunctor<KType(const Expr& n, Args... args),  // expect  function that map the input to the KeyType
                          ParentFunctor,
                          m> : public ParentFunctor { ...
using ThisFunctionType = R(const Expr& n, Args... args);
static_assert(std::is_base_of<ExprFunctor<ThisFunctionType>, ParentFunctor>::value, 
"ParentFunctor must be of ExprFunctor Type");

// we propose to have a unique_ptr to the memorizing dictionary (named as "A", that record the evaluation of the functor), so that reuse of MemorizedVersion will not act on the already filled dictionary (after each memorization, std::move it out).

// we will have another dictionary (called "B" here), that map KeyType to the input type

R VisitExpr(const Expr& n, Args... args) override;

// Everytime VisitExpr is called with input ‘x’, it will first check if the current dict B has the KeyType 
//     and if it has and the key is also mapped to the "same" input 'x', then we can just query the dict A
// Otherwise we need to evaluate and according to the above MemorizingPolicy 'override', we will 
//     insert(replace) the new(already existent) the evaluated value into dict A

// Designed in this twisted way so that if KeyType = AST, InputType=<AST x Lattice>, we can 
//     capture the final value of the fixpoint computation.

// Of course here if KeyType = InputType, it will become trivial memorization and don't need dict B
}

// Similar for StmtFunctor

// Then IRFunctor
template <typename R, 
            typename KType, 
            MemorizingPolicy m,
            typename ParentFunctor,
            typename ...Args, 
            typename ...InitArgs>
class RecordingIRFunctor<KType(const NodeRef& n, Args... args), 
                          ParentFunctor,
                          m> : RecordingExprFunctor<KType(const Expr& n, Args... args), ParentFunctor, m>,
                               RecordingStmtFunctor<KType(const Stmt& n, Args... args), ParentFunctor, m> {...
}

Execution order

In the case of Live Variable Analysis, we can have the memorized dictionary map each ASTNode, including a single variable, to a pair of Live Variable set. For example, the variable is a node in a provide statement, we for sure will have to, first, propagate the LHS of the provide (kill the defined variable) and, then, propogate the RHS of the provide. This order is exactly the reflection of how the execution order of the Provide Statement, or more generally TVM IR is defined.

That means, to some extent, if in the future there will be a very detailed formal semantic of the TVM IR, this analysis framework will need to change correspondingly.

Related Discussion Post: https://discuss.tvm.ai/t/traditional-program-analysis-on-tvm-ir/4960

wweic commented 4 years ago

Hey, there is a similar RFC on the topic(https://github.com/apache/incubator-tvm/issues/4449).

cc @gussmith23 @zhiics @jroesch @MarisaKirisame @icemelon9 @slyubomirsky

DKXXXL commented 4 years ago

Other things: IsLattice

The basic usage is just:

template <typename T>
struct STD_SET_IS_LATTICE : public IsLattice<std::set<T>> {
  using SET = std::set<T>;
  public:
     SET join(const SET& a, const SET& b);
     SET meet(const SET& a, const SET& b);
     const SET& top();
     const SET& bottom();
     bool equal(const SET& a, const SET& b);
     static const STD_SET_IS_LATTICE<T>& __ops() {
       static const STD_SET_IS_LATTICE<T> op;
       return op;
     }
     static const STD_SET_IS_LATTICE<T>& ops = STD_SET_IS_LATTICE<T>::__ops();    
};

and when trying to make things concrete, you can try to declare something like class X : ffdfa<std::set, STD_SET_IS_LATTICE , STD_SET_IS_LATTICE::ops> .

Other things: A Concrete computing example for Live Variable Analysis

To make the proposal more understandable, we compute on a concrete example, on a program given as below:

Block1
    a = b
    Block2
        If (iszero b)
            then x = 1
            else x = 2
        c = x + 2
 # {} as Alive, at the exit point of ASTNode Block1

Note that we will calculate, the live variables at entry/exit point for each ASTNode (except for variables to not disrupt the format).

We know at the end there is no live variable, and due to the recursive calls, this empty set will be propagated up to the exit points of Block2 and Statement c = x+2(the leaf of the AST):

Block1
    a = b
    Block2
        If (iszero b)
            then x = 1
            else x = 2
          # {x} as Alive -- exit point of If Statement
          # {x} as Alive -- entry point of 'c=x+2'
        c = x + 2
          # {}  as Alive -- exit point of 'c=x+2'
      # {}  as Alive -- exit point of Block2 
  # {}  as Alive -- exit point of Block1 

A single assignment statement will change live variable accordingly, and due to how Block is dealt with, the exitpoint of IfStatement will have the same value as the entry point of the following statement.

IfStatement is dealt with according to the pseudo-code, and note that the conditional will have its information as well.

Reflecting from that pseudo-code, because LiveVariableAnalysis Conditional AfterCondition is evaluated, and MemorizedVersion will record all the input and output of the LiveVaraibleAnalysis call once evaluated, so this conditional will be a key in the cache dictionary.

At the end, it will be the following:

  # {b} as Alive, entry of Block1, since recursion directly propagate it up
Block1
      # {b} as Alive, entry of 'a = b'
    a = b
      # {b} as Alive, exit of 'a = b'
      # {b} as Alive, entry of Block2, since recursion directly propagate it up
    Block2
          # {b} as Alive, entry of If Statement
              # {b} as Alive, entry of '(iszero b)'
        If (iszero b)
              # {} as Alive, exit of '(iszero b)'
                 # {} as Alive, entry of 'x=1'
            then x = 1
                 # {x} as Alive -- exit of 'x=1'
                   # {} as Alive, entry of 'x=2'
            else x = 2
                   # {x} as Alive -- exit of 'x=2'
          # {x} as Alive -- exit point of If Statement
          # {x} as Alive -- entry point of 'c=x+2'
        c = x + 2
          # {}  as Alive -- exit point of 'c=x+2'
      # {}  as Alive -- exit point of Block2 
  # {}  as Alive -- exit point of Block1 
junrushao commented 4 years ago

Hi,

Thank you for your detailed proposal! I think the design of data flow analysis works for general cfgs.

However, I think I am not knowledgeable enough to understand the motivation and use cases of liveness analysis in TVM IR. Could you kindly give a real-world example how it may help analyze TVM IR?

Thank you so much!

DKXXXL commented 4 years ago

Hi @junrushao1994 , Thanks for the comments! :) I will get back to you about a real world example as soon as possible.

But generally speaking, dead code elimination pass is required in some workloads; and after TVM output program with intrinsics and not even targetting CPU or GPU, LLVM or other later processing stage cannot help.

What's more, applying transformations at the stage when TVM is processing is a much easier way and not error-prone.

MarisaKirisame commented 4 years ago

Aah, the split of IR. It seems like both side need their own analysis, and thus have to reinvent the wheel twice. Looking forward to the merge of IR.

DKXXXL commented 4 years ago

Hi @junrushao1994 , an over-simplified example from the industrial context is the following:

...
B0 = tvm.compute((m,n), lambda i,j: A0[i,j] + 2*A1[i,j], name = "B0")
C0 = tvm.compute((m,n), lambda i,j: A0[i,j] + 2*A1[i,j], name = "C0")
D0 = tvm.compute((m,n), lambda i,j: B0[i,j] + 3*C0[i,j], name = "D0")
...

The customized TVM will schedule and use compute_at to the extreme, and transform into something like

...
for (i, 0, m) {
for (j, 0, n) {
        B0[0] = (A0[((i*stride) + (j*stride))] + (2f*A1[((i*stride) + (j*stride))]))
        C0[0] = (A0[((i*stride) + (j*stride))] + (2f*A1[((i*stride) + (j*stride))]))
        D0[((i*stride) + (j*stride))] = (B0[0] + (2f*C0[0]))
}}
...

This gives our 'incomplete' CSE and Copy Propagation a chance to make the C0 assigned by B0 and replace C0’s appearance in D0 into B0 and make C0 dead (or not? dependent on the future).

   ...
for (i, 0, m) {
for (j, 0, n) {
        B0[0] = (A0[((i*stride) + (j*stride))] + (2f*A1[((i*stride) + (j*stride))]))
        C0[0] = B0[0]
        D0[((i*stride) + (j*stride))] = (B0[0] + (2f*B0[0]))
}}
   ...

The above ‘incomplete’ CSE and Copy Propagation pass can do things safely in a straight-line code in a small range (without data-flow analysis), but the same thing did not happen for dead code elimination – if we don’t know any live information out of this for loop, we cannot just eliminate the assignment to C0[0].

Generally speaking, dead code can arise after copy propagation and how dead code arises in TVM is similar to how they arise in LLVM and traditional compiler passes.

comaniac commented 4 years ago

Hey @DKXXXL, thanks for the example! Just curious. Do you think the case of copy propagation caused dead code happens in current workloads? Or this is more like a concern to the TVM programming model as your example?

Another question is that the name "data-flow" analysis confuses me a bit, because it seems to me that the proposed framework is not limited to data flow analysis but general IR analysis or program analysis. Could you clarify it a little bit more?

Thanks.

DKXXXL commented 4 years ago

Hi @comaniac , Thanks for commenting. :)

Yes. This is a real problem happening in the industrial context. The current solution is either over-conservative or unsound.

About the name of "Data-flow Analysis", I think it is more a terminology question. For example the CFA is also a kind of program analysis but I don't think that is expressible in this framework or required in TVM IR (since there is not even first class function in TVM IR). Also I am not sure if this is expressible enough for all of the program analysis. Program Analysis is a really broad field.

In my opinion, this framework can express most of the Data-flow Analysis phrased in the format of fixpoint computation in the first chapter of Principle Of Program Analysis.

comaniac commented 4 years ago

@DKXXXL , thanks for the clarification and it seems fair enough to me :) Then it seems like #4449, #3895 and this RFC should be unified and designed together.

DKXXXL commented 4 years ago

Hi @comaniac , Thanks for your suggestion! :) Because of your suggestion and Unified IR Post, I think I need to dig into the design and infrastructure of Relay for a while to see how to design an DFA infrastructure that fits into both IRs.

xqdan commented 4 years ago

@tqchen, what's your suggestion? IMO, low level IR has been there for a while, and we've had experience and understanding in low level ir. the post of unified ir to me is just a high level proposal, details needs to be discussed further, such as,

The most valuable thing to me is we can make ops white box with unified ir, that is to say, we can analyze into the ops, which is totally different with the current way, we are using ops as black box in graph framework, or separated ir. with white box ops, we don't need to care about the ops name or the formula ops are using, we can optimize them in a general way. see what's XLA doing.

The abstraction of high level ir matters. you don't want to lower ops body too much from tensor expression into nested loops, since it's a big cost of analyzing ir with too much context.

so I suggest we can keep moving this work on this thread meanwhile we discuss how to reuse the solution.

tqchen commented 4 years ago

We will still keep variations of functions in the unified IR, which means tensor level loop nesting might needs its own analysis.

Let us focus on discussing what is a good design for related tensor-level dataflow analysis. Note that we do want to make sure reuse the common infrastructures such as the function structure and pass, but perhaps not in the level of IR pass atm.

kaitingwang commented 4 years ago

@MarisaKirisame : I have some thoughts regarding your suggestion to the possibility of a merged program analysis framework between RelayIR and HalideIR between issue 3895 and this current issue. I found @DKXXXL 's proposal here clean and intuitive in a way that it does not require building of CFG. This is leveraging the restrictive control flow capability inherent in the HalideIR (i.e. the IR only supports for and if-then-else), although it would be good to bring to clarity the current proposal of how 'location' information is to be tracked. If the proposal needs to be extended to handle RelayIR, which needs to handle function-as-first-class citizen (now CFA is needed), as well as 'reference' (a.k.a. pointer analysis) introduced as a result of autodiff, the framework would need to be more comprehensive. Perhaps I'm not fully understanding the RelayIR, but are you thinking along the line of https://plum-umd.github.io/abstracting-definitional-interpreters/ for a program analysis framework as a complete solution? Or, you're just thinking a global analysis framework (i.e. some global data structure that keeps up-to-date analysis information across different passes)? Comments appreciated!

kaitingwang commented 4 years ago

@gussmith23 @jroesch Comments appreciated!

MarisaKirisame commented 4 years ago

@kaitingwang I have talked to a professor doing program analysis, and he tell me aam > adi. so yeah, basically I am looking at the direction of generic abstract interpretation framework. However, I do not have time to pull this off, even though I had wanted to do so for a long time :(

MarisaKirisame commented 4 years ago

@kaitingwang are you planning on working on it btw?

DKXXXL commented 4 years ago

Hi @MarisaKirisame , thanks for commenting! So which one are you looking for? ADI or AAM? I am not familiar with either, but I remember AAM requires a state machine transition definition of the semantics, where ADI requires a definitional interpreter to define the semantics. My another question is, if it is possible to have the one in your mind to generate the current framework I propose, or just a Data-flow Analysis framework with forward/backward/must/may analysis, because I am concerned with the possible mind burden for the non-functional side programmer to direct generate the desired Data-flow analysis with the AAM/ADI you proposed. It is at least a bit hard for me. Thanks :)

MarisaKirisame commented 4 years ago

I am thinking of Abstracting Abstract Machine. Relay has a very simple semantic so it wont be the problem (the problem is Abstracting Abstract Machine itself).

If it is ever done, we can expose a dataflow-analysis like interface. A user might get away with only defining the base lattice to contain abstraction of interest.

DKXXXL commented 4 years ago

Hi @MarisaKirisame , I currently have a question about representing "location inside the program" because when the users want to query about the Dataflow-Info, they will need to use "location".

TLDR: Pointers to ASTNodes are not one-to-one correspondent to "locations in the program". Can the usage/development of AAM/ADI lead to a good enough data representation of "location" for the compiler-pass developer to easily use it to query about Dataflow-Info or other info?


First EDIT: In the following , add(x,x) is not a good enough example. We can have a better example in HalideIR :

Block {
  x = a  # Assignment A
  Block {
    ...
    Block {
      x = a # Assignment B
      ...
    }
  }
}

In the above code, Assignment A and Assignment B are actually represented using same ASTNode in the memory. Thus when a pointer to the "x=a" is queried about, there is an ambiguity whether the Dataflow-Info right before the first x or the second x is asked.


Detailed Version: I notice that both in HalideIR and RelayIR, we don't explicitly require AST to really be a tree but a DAG. That means a pointer to an ASTNode cannot easily represent a precise location in the program. For example, in HalideIR, add(x,x) might be a node with two pointers pointing to one node, however, using the pointer to x will lead to ambiguity of position -- is the user asking for the Dataflow-Info right before the first x or the second x?

The part in HalideIR is still fixable, at least a brutal way in mind is to just make DAG into a real tree; I also have another idea to fix that but the modification is not as simple as the current proposal, and I am currently concerned about how well-adapt this another solution is in this context of IRFunctor.

Moreover, I think the first brutal solution cannot adapt to RelayIR because it seems like (from the documentation) the semantic of DAG and Tree is different in RelayIR?

Basically, when the current compiler-pass developer wants to get Dataflow-info, in the current context, the signature of IRMutator/IRFunctor/IRVisitor only provides "NodeRef" as default. As above suggested (the ambiguity using NodeRef to represent location) I tend to believe this signature will have to extend if the compiler-pass developer wants to get Dataflow-Info.

MarisaKirisame commented 4 years ago

@DKXXXL The dag idea only make sense for purely functional code. This is why we provide ANF to transform from dag to tree, and Feature Manager to detect if it is dag. Pointer is a good representation of location in the program and is used that way.

DKXXXL commented 4 years ago

@MarisaKirisame : I might still be confused about the semantic and internal memory representation of RelayIR. I would appreciated if you can enlighten me: in the official document, the picture for ANF is still a DAG (a child can have several different parent) with respect to internal memory representation image

The above %v1 appears in multiple (two) locations (or three, if the definition part of the LetNode can be queried about), and when the user using the pointer to the ASTNode (of variable %v1), of which location the Dataflow-Info will be returned?


But anyway, do you suggest that in the HalideIR, it is recommended to also just use a pointer to represent location? (So as to align the design) Thank you for the explanation! :)

MarisaKirisame commented 4 years ago

@DKXXXL the variables/globalvariables/typevariables are free to be shared - variables are compared by pointer equality, so each new variable is distinct. however, nothing else in ANF mode are sharable. I only know that using the pointer is the canonical way in RelayIR. I am not sure about the lowerIR though.

DKXXXL commented 4 years ago

First Edit : rephrase the second question.


Hi @MarisaKirisame , I have several questions during the learning of AAM, about the advantage of implementing AAM over using the normal Dataflow Analysis framework (like the proposed one above) in the context of Relay. Because I have no experience in Relay at all, I would appreciate it if you can resolve these questions:

  1. I was thinking about AAM has the power to unify CFA with a wide range of DFA. It is very tempting but considering the current application area of Relay. I am considering Relay more like a "glue" language that able to perform graph level optimization; which means there won't be any "computationally intensive" piece of code written in Relay. That brings me the question -- what is the benefit of designing a general analysis infrastructure if it won't affect the performance (since the computational heavy part is not written in Relay)? You already have several analyses scattered around. Other than refactoring and giving a better, clean and maintainable developing environment, is there an urgent or short-term benefit to do that?

  2. Is there a benefit of adapting CFA in the context of Relay? In this post(#3895) I see you considering an example trying to combine pointer analysis with CFA, is that just for demonstration or that is a real requirement that can help the performance of Relay? I was thinking CFA will be very helpful in the context of Scheme because each function is very small, and CFA can hugely increase the precision of the Dataflow Analysis. However, in the official tutorial, the function in Relay seems really big. In that context, CFA doesn't seem to help the precision of DFA a lot; while my limited Functional Programming experience cannot come up with an example that CFA itself can help (something like devirtualization in C++ or inlining the callee). Could you give an example where CFA can help Relay code?

  3. I am currently still learning AAM. If I understand correctly, even using AAM requires the definition of the abstract domain and the definition of abstract transfer function separately for different analyses. What's more, I currently have doubts about expressing backward analysis and live variable analysis using AAM. I can only find the related post (mentioned at the very end) using abstract interpretation to define live variable analysis. This makes me wonder if AAM can subsume the general dataflow monotone framework. Could you give some references for using AAM to implement all kinds of analysis?

Great Thanks, and Happy New Year! :)

MarisaKirisame commented 4 years ago

@DKXXXL

  1. while it is true that tensor is the bottleneck, the high level code dictate what tensor will get computed. For example, you can do dead code elimination to remove unnecessary tensor operation(this is a real need rn), or deforestation(which require an effect analysis) to make tensor operation into one graph, which can then be further optimize by all graph-level pass. right now, we have dead code elimination but it is simply incorrect - it assume there is no effect, and @jroesch need it fixed.
  2. I dont think there is much value in CFA. However, we do need pointer analysis because the AD pass create reference (including reference of a closure that modify reference) everywhere, and AAM bring pointer analysis for free.
  3. Again, AAM bring pointer analysis for free because AAM map every variable to a location (which is what pointer analysis do). Also dataflow framework do not do closure, or we will stick with it. For backward analysis, I dont see a problem: the idea of AAM is to map every variable to a finite location, and you can do you backward transition on those locations just similarly. Happy new year!
zackcquic commented 3 years ago

Hi:

I am wondering if there is any update on this proposal (after IR unified)?

masahi commented 2 years ago

I assume this is no longer active.

kaitingwang commented 2 years ago

No longer active. Please close. Thanks.