[Suggestion] Do you consider accepting tensorflow model and then emit LLVM IR?

FrozenGene commented 6 years ago

I see your test is just call your C++ API. Do you consider in your project, you accept one trained tensorflow model and then emit LLVM IR, JIT it and execute it?

bzcheeseman commented 6 years ago

That's one of the ways I saw this project going originally. I wanted to avoid simply rewriting kernels in IR, and focus more on optimizations that aren't already in the LLVM toolkit like replacing FP math with int math, chunking loops, kernel fusion, and moving loops to GPU with the nvptx and amdgpu backends.

My plan for how this library might be useful at first was more or less as follows:

Provide basic C++ implementations of various useful kernels (Conv/BN/Activation, etc)
Provide optimization passes that are capable of performing the optimizations I talked about above. I don't know if they can go into LLVM directly because many of them will not guarantee correctness, but will rather have explicit and documented tradeoffs.

Later on, TF model importing & ONNX model importing.

I'm definitely open to changes in direction, especially if you want to help out!

FrozenGene commented 6 years ago

In fact, I am more familiar with LLVM, because I am a compiler developer. :-) I comment it, because I think your thought of DL framework is very very like my thought.

I do not want to express that you write kernel using LLVM IR directly, just like your current implementation using LLVM C++ API. (If I don't misunderstand your comment).

My thought is TF Model / ONNX ----> LLVM IR ----> CPU JIT / GPU. Of course, we should parse TF Model / ONNX protobuf and using LLVM C++ API to emit LLVM IR, and we will also write LLVM optimization pass you talked above.

For this: Provide basic C++ implementations of various useful kernels. Could you explain it more? Do you want to implement them as runtime to be called or using LLVM C++ api to emit LLVM IR directly?

bzcheeseman commented 6 years ago

I do not want to express that you write kernel using LLVM IR directly, just like your current implementation using LLVM C++ API. (If I don't misunderstand your comment). ... For this: Provide basic C++ implementations of various useful kernels. Could you explain it more? Do you want to implement them as runtime to be called or using LLVM C++ api to emit LLVM IR directly?

What I meant by this is that implementing the kernel for (say) gemm in IR (or using the C++ API) is costly and verbose, and there's no reason to do so when we can just provide a C++ implementation and then write passes that optimize the code generated by clang (and this also allows others to contribute more easily).

My thought is TF Model / ONNX ----> LLVM IR ----> CPU JIT / GPU. Of course, we should parse TF Model / ONNX protobuf and using LLVM C++ API to emit LLVM IR, and we will also write LLVM optimization pass you talked above.

I agree completely - keeping in mind what I said above ;) I think the easiest way to do the first part will be to translate TF/ONNX into a series of functions in C++ and then run the optimizing passes over them - essentially build a Module from the model.

FrozenGene commented 6 years ago

If I don’t understand wrong, your thought is for Conv / tanh and so on, you want to implement them using C++ and package them as runtime. Then others code call them. Of course, you can make the runtime as LLVM bitcode files and link them to other modules, right?

If so, I think this way has both sides. The advantage you just mention that we can implement them eand others can contribute easily. However, this way we can not handle the IR completely bacause we can not control Clang’s IR generation.

bzcheeseman commented 6 years ago

If I don’t understand wrong, your thought is for Conv / tanh and so on, you want to implement them using C++ and package them as runtime. Then others code call them.

Yes exactly.

Of course, you can make the runtime as LLVM bitcode files and link them to other modules, right?

Also true, yes.

If so, I think this way has both sides. The advantage you just mention that we can implement them eand others can contribute easily. However, this way we can not handle the IR completely bacause we can not control Clang’s IR generation.

I agree with you completely - there are drawbacks to both methods. I will say, however, that it might be nice to not have to write the IR by hand given how much more verbose it is/will be. We can also write C++ code, run codegen/opt and then hand-tune the IR; this strikes me as the best of both worlds, we don't have to hand-write IR and can use clang to generate large code blocks, but we can also have full control of the IR and its optimizations.

FrozenGene commented 6 years ago

If so, package them as LLVM bitcode files and link them to other modules is better. When we deploy it, we can avoid we need additional one runtime library.

My concern is when we use Clang’s IR generation, if the code is large, clang’s IR maybe different between Clang’s version; maybe hand-tune IR to suit our need very hard too. As far as I know, Intel’s nGraph use runtime this way. But TVM doesn’t use runtime this way(I am not sure TVM, if not true, please correct me)

FrozenGene commented 6 years ago

Additionally, if you decide to do as this way, I think you should parse TF / ONNX model and start to emit LLVM IR now, even it is just a + b, but this can prove this way is great. :-) Next, will be filling the miss part, operators / optimize passes and so on.

bzcheeseman commented 6 years ago

If so, package them as LLVM bitcode files and link them to other modules is better. When we deploy it, we can avoid we need additional one runtime library.

Agreed.

But TVM doesn’t use runtime this way(I am not sure TVM, if not true, please correct me)

TVM does generate IR on the fly as far as I know, but they also do not (afaik) run the llvm optimizing passes which is one thing I was hoping to take advantage of.

Additionally, if you decide to do as this way, I think you should parse TF / ONNX model and start to emit LLVM IR now, even it is just a + b, but this can prove this way is great. :-) Next, will be filling the miss part, operators / optimize passes and so on.

So what you're suggesting is:

An operator registry of .bc or .ll files (I'd rather write .ll files so I can read them)
A set of passes that perform optimizations
A model converter and AOT/JIT engine to bring in the execution phase

Does that summary sound about right?

FrozenGene commented 6 years ago

Agreed. I know TVM doesn’t leverage LLVM optimization passes or Polly, so I think we could do it. However, TVM introduced Halide IR to do graph optimization, in my option, Polly and LLVM optimization also can achive it (If not right, pls correct me)

My suggestions for operators, we could use C++ to implement them and use Clang to compile into one runtime LLVM bitcode file. You could also generate separate LLVM bitcode files, but we should link them into one bitcode file for us using it. .ll or .bc is not matter, they can convert to each other easily.

For 2, yes.

For 3, yes. We need one tool to parse protobuf model and generate LLVM IR, leverage CPU JIT / NVPTX to execute it.

bzcheeseman commented 6 years ago

Agreed. I know TVM doesn’t leverage LLVM optimization passes or Polly, so I think we could do it. However, TVM introduced Halide IR to do graph optimization, in my option, Polly and LLVM optimization also can achive it (If not right, pls correct me)

We would have to add a few optimizations, namely operator fusion, input tiling (loop splitting), and low bitwidth fp->int math, but otherwise yes exactly.

My suggestions for operators, we could use C++ to implement them and use Clang to compile into one runtime LLVM bitcode file. You could also generate separate LLVM bitcode files, but we should link them into one bitcode file for us using it. .ll or .bc is not matter, they can convert to each other easily.

I see - just save the output from a single run from clang so that it doesn't get regenerated - I like it.

We need one tool to parse protobuf model and generate LLVM IR, leverage CPU JIT / NVPTX to execute it.

Agreed 100% - I would add that we should target new enough llvm to use the AMDGPU backend as well.

FrozenGene commented 6 years ago

For additional optimization, the loop part I think introduce Polly into project can help us a lot. However, we should also leverage profile to tell us what is the bottleneck.

AMDGPU should not be the problem. As far as I know, AMDGPU is very mature in the recent version of LLVM.

bzcheeseman commented 6 years ago

For additional optimization, the loop part I think introduce Polly into project can help us a lot. However, we should also leverage profile to tell us what is the bottleneck.

Operator fusion isn't really something that people talk too much about outside of ML - Polly may have a pass that combines loops or something that will function effectively the same way.

AMDGPU should not be the problem. As far as I know, AMDGPU is very mature in the recent version of LLVM.

That's my understanding as well.

FrozenGene commented 6 years ago

Except Polly, I find that recent GCC / LLVM implement it. See this link: https://bugs.llvm.org/show_bug.cgi?id=34364 There are two patches for LLVM to handle it. Does this satisfy your thought or DL model has some special requirement we should write by ourselves? If so, could you just write one example?

The profile I talked not only about loop but also whole project performance. My meaning is we should have related benchmark ML models to help us find the bottleneck.

Also Operator fusion isn't really something that people talk too much about outside of ML - Polly may have a pass that combines loops or something that will function effectively the same way.

Yes. But should investigate Polly deeper to verify.

bzcheeseman commented 6 years ago

Except Polly, I find that recent GCC / LLVM implement it. See this link: https://bugs.llvm.org/show_bug.cgi?id=34364 There are two patches for LLVM to handle it. Does this satisfy your thought or DL model has some special requirement we should write by ourselves? If so, could you just write one example?

I did find the polly tiling passes (finally), so my main concern for performance now shifts to kernel fusion - which as you say, we should look at Polly to verify.

The profile I talked not only about loop but also whole project performance. My meaning is we should have related benchmark ML models to help us find the bottleneck.

Absolutely.

FrozenGene commented 6 years ago

If your kernel is mainly about GPU kernel fusion, Polly-ACC has maybe done it. http://spcl.inf.ethz.ch/Research/Parallel_Programming/Polly-ACC/. However, from the doc of Polly,

We currently perform classical loop transformations, especially tiling and loop fusion to improve data-locality. Polly can also exploit OpenMP level parallelism, expose SIMDization opportunities. Work has also be done in the area of automatic GPU code generation.

The main trunk of Polly should combine the work of Polly-ACC.

bzcheeseman commented 6 years ago

More I meant if we have separate loops (like Conv -> BN -> ReLU) then all that can be done inside one kernel call without launching multiple kernels. For example, I would like

for (uint64_t k = 0; k < K; k++) {
    for (uint64_t m = 0; m < M; m++) {
      for (uint64_t n = 0; n < N; n++) {
        C[m][n] += ReLU(scaledBN(alpha * A[m][k] * B[k][n]));
      }
    }
  }

FrozenGene commented 6 years ago

In my opinion, Polly should not contain this. If we use Polly, we maybe need to modify Polly's detections to suit our needs.

bzcheeseman commented 6 years ago

I see - then this is probably the most important optimization we can perform. Looking at the TVM benchmarks this was the way they got such impressive performance numbers.

Then the problem becomes:

Read in TF/ONNX proto file
Perform graph optimizations by fusing kernels
Emit IR and send through Polly

OR

Read in TF/ONNX proto file
Emit IR and send through modified version of Polly

I'm leaning towards modifying polly (or adding a pass based on polly) to do the kernel fusion

FrozenGene commented 6 years ago

Yes. TVM does step 2 to introduce Halide IR. But I think we also do it using LLVM IR. i.e. We write LLVM's transformation pass. Right? If you have any concerns, I am very happy to hear it.

For step 3, Polly can be done before / after we call LLVM's standard optimization passes. According to this doc: https://polly.llvm.org/docs/Architecture.html#polly-in-the-llvm-pass-pipeline Both has advantages and disadvantages. But I personally prefer run Polly before LLVM's standard optimization passes (the default way Polly takes), because Polly could close to the original's IR we give it.

bzcheeseman commented 6 years ago

Yes. TVM does step 2 to introduce Halide IR. But I think we also do it using LLVM IR. i.e. We write LLVM's transformation pass. Right? If you have any concerns, I am very happy to hear it.

I agree with you in principle, but I think that will be quite difficult because of the analysis required. I've been examining the polly ScopPass and I think this is what we will want to implement actually. Of course this will require learning a whole new IR, but I think much of the analysis we would need to perform has already been done and the transformation will be simpler.

FrozenGene commented 6 years ago

I see your modified comment now.

Yes, modify Polly is also a good choice, but we would maintain one modified Polly trunk. Your original choice can be done without modify Polly or LLVM, just add our optimization passes.

In fact, I saw one video just now: https://www.youtube.com/watch?v=uq67__tfdtQ He chooses to modify Polly. :-)

bzcheeseman commented 6 years ago

Sorry - I didn't mean modifying Polly directly - rather adding a pass to Polly. Since it's just an LLVM pass I believe that we should be able to simply add a new one and maintain that pass rather than having to maintain a modified version of Polly. I think we agree entirely on the high-level plan :)

FrozenGene commented 6 years ago

You mean that we add one new pass (for example, DLScopPass) into Polly's source code directory. Then DLScopPass : public ScopPass and override related method, Right?

Yes, but it is in fact a very huge project. :-)

bzcheeseman commented 6 years ago

You mean that we add one new pass (for example, DLScopPass) into Polly's source code directory. Then DLScopPass : public ScopPass and override related method, Right?

Yes, but I don't think you have to add it into their source code directory, you can link it against the polly shared library and keep it outside the source (or at least this is my understanding). I'm going to try this with a pass that simply prints something just to see if I can even compile it. (as soon as polly finishes compiling)

Yes, but it is in fact a very huge project. :-)

oh yes it is :P

FrozenGene commented 6 years ago

From the basic principle of code, it should not have problem. If we do this way, we could separate Polly and LLVM, just write our LLVM / Polly optimization pass.

Though it is huge, but can start from one simple model (like a + b) to do. :-)

bzcheeseman commented 6 years ago

From the basic principle of code, it should not have problem. If we do this way, we could separate Polly and LLVM, just write our LLVM / Polly optimization pass.

I agree, but from my experience with this stuff just getting it to compile is the first challenge :P

Though it is huge, but can start from one simple model (like a + b) to do. :-)

Absolutely - that's why I was starting with dot product or just simple gemm - it's a common benchmark target that we can use to show this approach works well

FrozenGene commented 6 years ago

I think maybe can start parsing TF model / ONNX model and go through the whole line like my original suggestion. :-) hahahaha...

bzcheeseman commented 6 years ago

LOL fair enough. On the dev branch I think we can specialize the AST to read in the TF/ONNX graph and build a fairly simple operator registry before we worry too much about the polly SCOP optimizations

Also not sure if you saw, I added a roadmap to the README, feel free to add more details just for coordination sake.

FrozenGene commented 6 years ago

I see the roadmap in dev branch now. Maybe I prefer put the import TF model / ONNX model before operator registery. Because parsing model should be the start from the using / dev’s view, which just like reading one file, then we do lex analysis, parse, semantic, code generation and so on on the filed of compier.

bzcheeseman commented 6 years ago

Sorry, I guess that is slightly unclear - I was going to do a couple basic operators (gemm, maybe elementwise add) so that we could test the codegen after getting the parsing working. I figured both would be going on at the same time as we increase the complexity of supported models from the TF graph defs

FrozenGene commented 6 years ago

I just tried to parse TF model in the daylight time today using C++. Currently, I load the TF model successfully and the code is very short. So I think two work are on going at the same time just OK.

Additionally, I have found that you add one pass inherent ScopPass successfully, great! It proves that our thought can work.

Currently, I see the dev branch's code doesn't be very clear, you are refactoring the code structure. I wish to see you refactoring the code completely soon and maybe I can join to do some work at that time.

One more thing, I am in China, so the message maybe can not be replied in time. :-)

bzcheeseman commented 6 years ago

I just tried to parse TF model in the daylight time today using C++. Currently, I load the TF model successfully and the code is very short. So I think two work are on going at the same time just OK.

Great! I expected it would be fairly short because it's just protobuf but it's good to hear for certain!

Additionally, I have found that you add one pass inherent ScopPass successfully, great! It proves that our thought can work.

Yes - I'm exploring the capabilities right now.

Currently, I see the dev branch's code doesn't be very clear, you are refactoring the code structure. I wish to see you refactoring the code completely soon and maybe I can join to do some work at that time.

Yeah sorry - I'm refactoring quite a few things. My plan is to use something very similar to LLVM's Value* for nodes and then do codegen by visiting them (like Halide/TVM). I'll be afk for a lot of today, but I hope to get some work done tomorrow and have at least a clean Node base that can be built on.

One more thing, I am in China, so the message maybe can not be replied in time. :-)

Of course - I'm just excited - but I do realize we both have other things to do too :)

FrozenGene commented 6 years ago

I suddenly think about one thing about operator registery as one LLVM bitcode of runtime. For the operators, we will accept various type, then we will use C++ template to implement them, right? If we do it, C++ will do name mangling, in the LLVM IR, we can get the function name directly. But if we use LLVM IR directly to code generation these operators, we will not have this problem. Have you considered this issue?

bzcheeseman commented 6 years ago

Actually yes, I thought about this today :P

I've started cataloguing some useful helper functions - I think you were pretty much right originally - we could parse the model files into an AST with some RTTI and then just visit all the nodes and do codegen from an operator registry. Then it'll be easy to run optimizing passes.

FrozenGene commented 6 years ago

AST with some RTTI, what is the meaning of RTTI and detail information here?

So, you also agree emit LLVM IR directly, right?

In fact, if we use C++ to implement runtime of operator, we still can get the right function, but we will understand and introduce the name mangling rule into the project. For example, _Z means name mangling, then number means function name length and so on. But I think code gen LLVM IR directly maybe right way.

bzcheeseman commented 6 years ago

Sorry - RTTI = "Run Time Type Information" - in my mind being able to use LLVM's dyn_cast basically :P

So, you also agree emit LLVM IR directly, right?

Yes - just had to come around to it :)

FrozenGene commented 6 years ago

Ok, I see. You want to match the llvm’s -fno-rtti and use llvm’s style of rtti.

bzcheeseman commented 6 years ago

Yeah exactly.

FrozenGene commented 6 years ago

I mentioned that we should do benchmark, I want to compare the performance between tensorflow and our implementation. Currently, I want to test VGG16, ResNet18, MobileNet. But I do not find the inference benchmark of these three models in tensorflow, for example, how much time cost of per image. Do you know where I can find? Because I want to start to investigate and know the data.

bzcheeseman commented 6 years ago

I agree with you - I was planning on using this once we had enough operators implemented/the ast & codegen more complete

https://github.com/baidu-research/DeepBench

FrozenGene commented 6 years ago

This is a good benchmark!

However, as mentioned las comment, we should also test VGG16 / Resnet18 / MobileNet and so on, as we do compiler optimization, we should not care only single case, we should combine them, so like Resnet18 is good candidate. But I don’t find this in your link(I just go through, maybe ignore)

So, I want to test models like Resnet18 using tensorflow now, then we can get the data. But I don’t find this (like how much time per image costs). I think this should be done before by somebody.

bzcheeseman commented 6 years ago

I'm sorry - I misunderstood your comment. I do also agree with you, and we should definitely benchmark both single operations and the combinations :)

I'd start here: https://www.tensorflow.org/performance/benchmarks (which I'm sure you've seen)

Also take a look at SqueezeNet - primarily developed for size/speed so we may see some impressive gains from that.

FrozenGene commented 6 years ago

I have used benchmark_model tool to test tensorflow models.(export_inference / freeze_graph / benchmark_model) using Intel CPU i7 on my mac. I have also compared with TVM. TVM has better result on the mobilenets ( about 13ms) than tensorflow (about 56ms), vgg16 (TVM is about 260ms, tensorflow is about 290ms). Resnet (TVM is about 900ms, but tensorflow is about 148ms). But when we set loop unroll in the TVM, TVM has great speed improvement, in the Resnet, the time is only about 21ms (I will make sure the number tomorrow). TVM performs very better than tensorflow.

bzcheeseman commented 6 years ago

TVM is definitely the standard to beat (or at least meet). Since TVM allows you to manually schedule computations I wouldn't expect we could get better than TVM/Halide, but if we can get comparable performance with much less user tuning effort I think that's a significant win.

FrozenGene commented 6 years ago

I campared wrong with Resnet. I find that Resnet of TVM is Resnet18, but tensorlfow is Resnet50. However, I find that Resnet50 of TVM is slower than tensorflow, but mobilenet / vgg16 beat tensorlow much as last comment of mine.

FrozenGene commented 6 years ago

I find that for x86, TVM only do schedule for resnet-18, not resnet-50, so it is slow than tensorflow. Your last comment's manually schedule means this situation, right? If we introduce polly, I wish we can do automatic schedule.

bzcheeseman commented 6 years ago

Exactly yeah. Polly will free us from having to manually schedule the computations and should find a good schedule for us - so we won't have this issue of having the wrong schedule for a given DNN architecture.

We will likely have to modify the polly tile sizes (for example) based on the target machine though.

FrozenGene commented 6 years ago

Yes. I find this problem. For example, the schedule of resnet-18 in TVM, it will care about the convolution size / padding / network structure and so on, then it will do AVX instruction generation. If I give it Resnet50, it recognized failed.

However, if we use Polly, how to do for this situation? For example, we have x86's avx2 CPU features, we should use avx cpu instruction, but we will also like TVM's strategy, for example: https://github.com/dmlc/tvm/blob/8ef4fdef229fcce37f9c71bd4b916811471fea9d/topi/python/topi/x86/conv2d.py#L35

bzcheeseman commented 6 years ago

That's why I wanted to use polly/llvm in the first place - in theory it should know enough about the host computer that it can generate fast code. I know on my machine running opt -O3 -polly <file> outputs vectorized IR and vector ASM after the backend so I would assume it's similar on most other machines? Clang must perform this check at some point.

FrozenGene commented 6 years ago

Clang should do nothing about it. Clang uses polly is clang -O3 -mllvm -polly. Clamg passes the argument to llvm directly. And polly also eats LLVM IR directly.

Your meaning is polly outputs vectorize IR and will do the same work as schedule like TVM. But how polly can control the same or better than TVM? TVM can control the vextorize ir according to specific dnn model

bzcheeseman / Hobbit

[Suggestion] Do you consider accepting tensorflow model and then emit LLVM IR? #1