Open ykdu opened 4 years ago
Thanks for your feedback. Jittor will automatically dump each fused op to a c++ file. You can view the file path of each fused op through jt.profiler. e.g.
import jittor as jt
jt.profiler.start(0,0)
a=jt.float32([1,2,3])
b=(a+a).data
jt.profiler.stop()
print(jt.profiler.report())
The output is as follows:
[i 0323 14:41:52.916771 40 profiler.cc:212]
Profile result, sorted by TotalTime
('it/s' represent number of iterations per sec)
Name FileName Count TotalTime AvgTime MinTime MaxTime Input Output Compute
[opkey0:array][opkey1:unary[Tx:int64][Ty:float32][OP:cast][alloc_i02][JIT:1][JIT_cpu:1][index_t:int32]][opkey2:broadcast_to[Tx:int32][DIM=1][BCAST=1][JIT:1][JIT_cpu:1][index_t:int32]][opkey3:binary[Tx:float32][Ty:int32][Tz:float32][OP:multiply][alloc_o22][JIT:1][JIT_cpu:1][index_t:int32]][JIT:1][JIT_cpu:1][graph:000020,010030,020031,][var_info:1101111121]
/home/g_word/.cache/jittor/master/g++/jit/_opkey0:array__opkey1:unary_Tx:int64__Ty:float32__OP:cast__alloc_i02__JIT:1__JIT_cpu:1__in...hash:5cd3fc01961b9762_op.cc
1 877ns 877ns 877ns 877ns 26.1MB/s 13MB/s 3.42Mit/s
[['Name', 'FileName', 'Count', 'TotalTime', 'AvgTime', 'MinTime', 'MaxTime', 'Input', 'Output', 'Compute'], ['[opkey0:array][opkey1:unary[Tx:int64][Ty:float32][OP:cast][alloc_i02][JIT:1][JIT_cpu:1][index_t:int32]][opkey2:broadcast_to[Tx:int32][DIM=1][BCAST=1][JIT:1][JIT_cpu:1][index_t:int32]][opkey3:binary[Tx:float32][Ty:int32][Tz:float32][OP:multiply][alloc_o22][JIT:1][JIT_cpu:1][index_t:int32]][JIT:1][JIT_cpu:1][graph:000020,010030,020031,][var_info:1101111121]', '/home/g_word/.cache/jittor/master/g++/jit/_opkey0:array__opkey1:unary_Tx:int64__Ty:float32__OP:cast__alloc_i02__JIT:1__JIT_cpu:1__in...hash:5cd3fc01961b9762_op.cc', '1', '877', '877', '877', '877', '2.7366e+07', '1.3683e+07', '3.42075e+06']]
[i 0323 14:41:52.916797 40 profiler.cc:293]
Memory profile result, sorted by CheckTimes
Name FileName CheckTimes TLBMissRate
It shows the running time, calculation speed, file path, etc. of each fused op. Where /home/g_word/.cache/jittor/master/g++/jit/_opkey0:array__opkey1:unary_Tx:int64__Ty:float32__OP:cast__alloc_i02__JIT:1__JIT_cpu:1__in...hash:5cd3fc01961b9762_op.cc
is the file path of op a+a.
It should be noted that these op files may depend on the environment of jittor to run. And these functions are currently only used for debugging, and there is no official external support.
hi @Gword
Thanks for the quick response. The process you introduced can be successfully reproduced. But the C++ code doesn't contain SIMD intrinsics. Since Jittor will do vectorization, is there any other config should I do to enable SIMD and multithread?
Another question: when I try a matmul case
jt.profiler.start(0,0)
c = jt.float32([[1,1,1], [2,2,2]]) # 2*3
d = jt.float32([[1,1,1,1], [2,2,2,2], [3,3,3,3]]) # 3*4
e = c.matmul(d).data
jt.profiler.stop()
print(jt.profiler.report())
three temp .cc files are generated (looks like broadcast+mkl-matmul+cast).
/home/duyunkai/.cache/jittor/default/clang-8/jit/_opkey0:broadcast_to_Tx:float32__DIM=3__BCAST=1__alloc_i02__JIT:1__JIT_cpu:1__index_t:int3...hash:f9e2e96c458b4748_op.cc
1 26.2ms 26.2ms 26.2ms 26.2ms 2.68KB/s 1.19KB/s 916 it/s
mkl_matmul[T:float32][Trans_a:N][Trans_b:N][JIT:1][JIT_cpu:1][index_t:int32]
/home/duyunkai/.cache/jittor/default/clang-8/jit/mkl_matmul_T:float32__Trans_a:N__Trans_b:N__JIT:1__JIT_cpu:1__index_t:int32__hash:f4de438e440df78b_op.cc
1 19.4ms 19.4ms 19.4ms 19.4ms 0 B/s 1.61KB/s 413 it/s
[opkey0:unary[Tx:int64][Ty:float32][OP:cast][alloc_i02][alloc_o12][JIT:1][JIT_cpu:1][index_t:int32]][JIT:1][JIT_cpu:1][graph:][var_info:0222]
/home/duyunkai/.cache/jittor/default/clang-8/jit/_opkey0:unary_Tx:int64__Ty:float32__OP:cast__alloc_i02__alloc_o12__JIT:1__JIT_cpu:1__index...hash:1435f2cc66833d3d_op.cc
2 852ns 426ns 376ns 476ns 161MB/s 80.6MB/s 21.1Mit/s
I'm confused what do these three files mean(different from broadcast+mul+reduce), and is there a fused c++ file (In my mind, Jittor will "fuse" meta-ops and produce only one fused C++ code) ?
There are two ways to do vectorization in Jittor. The first is to manually vectorize using the vectorize pass, and the second is to let the underlying compiler do vectorization automatically. In your example, jittor turned on the optimization option to let the underlying compiler do vectorization automatically.
For your second question, Jittor's ops are all composed of meta ops. Jittor will fuse meta ops into several fused ops through some fuse rules. In your matmul case, Jittor will generate 6 meta ops, the specific implementation can refer to the class Linear in nn.py. And 4 meta ops will be fused into one fused op, which is the first file in your output. The other two meta ops are the same type of op, which is the third file in your output.
Since conv, gemm and other algorithms have been optimized by the hardware library very well, we will forward the calculation process. In your case, the first fused op was identified as matmul and forwarded to mkl_matmul for calculation, which is the second file in your output.
Any flag, e.g. save-temps, to dump the C++ code?