Jittor / jittor

Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators.
https://cg.cs.tsinghua.edu.cn/jittor/
Apache License 2.0
3.07k stars 308 forks source link

How to dump the generated C++ code #22

Open ykdu opened 4 years ago

ykdu commented 4 years ago

Any flag, e.g. save-temps, to dump the C++ code?

Gword commented 4 years ago

Thanks for your feedback. Jittor will automatically dump each fused op to a c++ file. You can view the file path of each fused op through jt.profiler. e.g.

import jittor as jt

jt.profiler.start(0,0)
a=jt.float32([1,2,3])
b=(a+a).data
jt.profiler.stop()
print(jt.profiler.report())

The output is as follows:

[i 0323 14:41:52.916771 40 profiler.cc:212]
Profile result, sorted by TotalTime
('it/s' represent number of iterations per sec)
      Name  FileName     Count TotalTime   AvgTime   MinTime   MaxTime     Input    Output   Compute
[opkey0:array][opkey1:unary[Tx:int64][Ty:float32][OP:cast][alloc_i02][JIT:1][JIT_cpu:1][index_t:int32]][opkey2:broadcast_to[Tx:int32][DIM=1][BCAST=1][JIT:1][JIT_cpu:1][index_t:int32]][opkey3:binary[Tx:float32][Ty:int32][Tz:float32][OP:multiply][alloc_o22][JIT:1][JIT_cpu:1][index_t:int32]][JIT:1][JIT_cpu:1][graph:000020,010030,020031,][var_info:1101111121]
          /home/g_word/.cache/jittor/master/g++/jit/_opkey0:array__opkey1:unary_Tx:int64__Ty:float32__OP:cast__alloc_i02__JIT:1__JIT_cpu:1__in...hash:5cd3fc01961b9762_op.cc
                             1     877ns     877ns     877ns     877ns  26.1MB/s    13MB/s 3.42Mit/s

[['Name', 'FileName', 'Count', 'TotalTime', 'AvgTime', 'MinTime', 'MaxTime', 'Input', 'Output', 'Compute'], ['[opkey0:array][opkey1:unary[Tx:int64][Ty:float32][OP:cast][alloc_i02][JIT:1][JIT_cpu:1][index_t:int32]][opkey2:broadcast_to[Tx:int32][DIM=1][BCAST=1][JIT:1][JIT_cpu:1][index_t:int32]][opkey3:binary[Tx:float32][Ty:int32][Tz:float32][OP:multiply][alloc_o22][JIT:1][JIT_cpu:1][index_t:int32]][JIT:1][JIT_cpu:1][graph:000020,010030,020031,][var_info:1101111121]', '/home/g_word/.cache/jittor/master/g++/jit/_opkey0:array__opkey1:unary_Tx:int64__Ty:float32__OP:cast__alloc_i02__JIT:1__JIT_cpu:1__in...hash:5cd3fc01961b9762_op.cc', '1', '877', '877', '877', '877', '2.7366e+07', '1.3683e+07', '3.42075e+06']]
[i 0323 14:41:52.916797 40 profiler.cc:293]
Memory profile result, sorted by CheckTimes
           Name       FileName     CheckTimes    TLBMissRate

It shows the running time, calculation speed, file path, etc. of each fused op. Where /home/g_word/.cache/jittor/master/g++/jit/_opkey0:array__opkey1:unary_Tx:int64__Ty:float32__OP:cast__alloc_i02__JIT:1__JIT_cpu:1__in...hash:5cd3fc01961b9762_op.cc is the file path of op a+a.

It should be noted that these op files may depend on the environment of jittor to run. And these functions are currently only used for debugging, and there is no official external support.

ykdu commented 4 years ago

hi @Gword

Thanks for the quick response. The process you introduced can be successfully reproduced. But the C++ code doesn't contain SIMD intrinsics. Since Jittor will do vectorization, is there any other config should I do to enable SIMD and multithread?

Another question: when I try a matmul case

jt.profiler.start(0,0)
c = jt.float32([[1,1,1], [2,2,2]]) # 2*3
d = jt.float32([[1,1,1,1], [2,2,2,2], [3,3,3,3]]) # 3*4
e = c.matmul(d).data
jt.profiler.stop()
print(jt.profiler.report())

three temp .cc files are generated (looks like broadcast+mkl-matmul+cast).

         /home/duyunkai/.cache/jittor/default/clang-8/jit/_opkey0:broadcast_to_Tx:float32__DIM=3__BCAST=1__alloc_i02__JIT:1__JIT_cpu:1__index_t:int3...hash:f9e2e96c458b4748_op.cc
                             1    26.2ms    26.2ms    26.2ms    26.2ms  2.68KB/s  1.19KB/s  916 it/s
mkl_matmul[T:float32][Trans_a:N][Trans_b:N][JIT:1][JIT_cpu:1][index_t:int32]
          /home/duyunkai/.cache/jittor/default/clang-8/jit/mkl_matmul_T:float32__Trans_a:N__Trans_b:N__JIT:1__JIT_cpu:1__index_t:int32__hash:f4de438e440df78b_op.cc
                             1    19.4ms    19.4ms    19.4ms    19.4ms     0 B/s  1.61KB/s  413 it/s
[opkey0:unary[Tx:int64][Ty:float32][OP:cast][alloc_i02][alloc_o12][JIT:1][JIT_cpu:1][index_t:int32]][JIT:1][JIT_cpu:1][graph:][var_info:0222]
          /home/duyunkai/.cache/jittor/default/clang-8/jit/_opkey0:unary_Tx:int64__Ty:float32__OP:cast__alloc_i02__alloc_o12__JIT:1__JIT_cpu:1__index...hash:1435f2cc66833d3d_op.cc
                             2     852ns     426ns     376ns     476ns   161MB/s  80.6MB/s 21.1Mit/s

I'm confused what do these three files mean(different from broadcast+mul+reduce), and is there a fused c++ file (In my mind, Jittor will "fuse" meta-ops and produce only one fused C++ code) ?

Gword commented 4 years ago

There are two ways to do vectorization in Jittor. The first is to manually vectorize using the vectorize pass, and the second is to let the underlying compiler do vectorization automatically. In your example, jittor turned on the optimization option to let the underlying compiler do vectorization automatically.

For your second question, Jittor's ops are all composed of meta ops. Jittor will fuse meta ops into several fused ops through some fuse rules. In your matmul case, Jittor will generate 6 meta ops, the specific implementation can refer to the class Linear in nn.py. And 4 meta ops will be fused into one fused op, which is the first file in your output. The other two meta ops are the same type of op, which is the third file in your output.

Since conv, gemm and other algorithms have been optimized by the hardware library very well, we will forward the calculation process. In your case, the first fused op was identified as matmul and forwarded to mkl_matmul for calculation, which is the second file in your output.