Open Orion34-lanbo opened 2 years ago
Update 20220819:
For SparseToDense
op, we are able to lowering to mhlo
operations. After lowering, we have found out that Disc
does not support mhlo::ScatterOp
's codegen. Adding full support for scatter
will take a considerable amount of time, thus we will hold this action and continue with supporting tf.embedding_column
.
For Unique
op in tf.embedding_column
, we have noticed that grappler's ArithmeticOptimizer
will do a simplification for unique+gather+sparsesegmentxxx
pattern to sparsesegmentxxx
. We need to make sure whether we need to support Unique
for tf.embedding_column
.
Updates 22020830:
We have decided to support lowering for tf.embedding_column
related ops by directly emit code for the following ops to make DISC support tf.embedding_column
tf.SparseReshape
tf.SparseFillEmptyRows
tf.SparseSegmentMean
tf.Where
Following pengzhan's origin poc impl, we will directly lowering the 4 ops to corresponding ops as in mhlo_disc
and lmhlo_disc
Dialect, and will do codegen for these ops in DiscLhloLegalizeRootsToParallelLoops
pass.
Latest TODO list for supporting tf.embedding_column
on X86 device:
tf.SparseReshape
(Still need a patch to support -1
in new_shape
)tf.SparseFillEmptyRows
tf.SparseSegmentMean
(Need to do some perf improvement)tf.Where
Updates 22020830: We have decided to support lowering for
tf.embedding_column
related ops by directly emit code for the following ops to make DISC supporttf.embedding_column
tf.SparseReshape
tf.SparseFillEmptyRows
tf.SparseSegmentMean
tf.Where
Following pengzhan's origin poc impl, we will directly lowering the 4 ops to corresponding ops as in
mhlo_disc
andlmhlo_disc
Dialect, and will do codegen for these ops inDiscLhloLegalizeRootsToParallelLoops
pass.Latest TODO list for supporting
tf.embedding_column
on X86 device:
- [x] codegen support for
tf.SparseReshape
(Still need a patch to support-1
innew_shape
)- [x] codegen support for
tf.SparseFillEmptyRows
- [x] codegen support for
tf.SparseSegmentMean
(Need to do some perf improvement)- [x] codegen support for
tf.Where
- [x] Support clustering for these 4 ops
- [ ] Benchmark for popular models with feature columns
- [ ] support multi-threading code generation for these ops
- [ ] support fusion for these ops
Update 20221205:
After initial tests on EasyRec
models, we have out-perform tensorflow baseline, however we compare codegen perf with hand-writing kernels, the gaps is inevitable. Thus, we need to update the ongoing items as follows:
scf.for
with scf.parallel
for current codegen implmhlo_disc.where
mhlo_disc.where
mhlo.sparse_reshape
, maybe work as normal reshape
opavx256
or avx512
to speed up kernel perfUpdate 20221223:
Recently, we have done a poc for output fusion for lmhlo_disc.where
op, the entire poc consist of several part of code change that we intend to split into the following for commit to main branch.
mhlo.sparse_reshape
for 2d sparse tensor reshapemhlo.real_dynamic_slice + mhlo.gather
to mhlo.gather + mhlo.real_dynamic_slice
for mhlo.real_dynamic_slicefrom
where` opSparseOpCpuFusionStrategy
to support output fusionlmhlo.dynamic_reshape
lmhlo.dynamic_gather
lmhlo_disc.where
op's output fusionUpdate 20230302:
Output fusion for lmhlo_disc.where
did not bring enough perf enhancement on EasyRec
model. After profiling and detailed analysis, we have found out that lmhlo_disc.sparse_segment_reduction
carry lots of computation for embedding_column
. We have done series of optimization since then.
perf is tested on Bare Metal Server used only by myself, with the following 128 * Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz
opt | latency(ms) | speed-up |
---|---|---|
baseline | 9.77 | - |
hand-write-fusion-opt | 6.79 | 1.43x |
disc | 8.05 | 1.21x |
We have achived a 1.21x
speed-up for tf's baseline, however we still have a 18.5% gap with hand-write-fusion-opt
.
sparse_segment_reduction
+ sparse_fill_empty_rows
rewrite to sparse_segment_reduction_with_empty_rows
(only works for inference case)sparse_segment_reduction_with_empty_rows
with fusion type kSparseReduction
sparse_segment_reduction
for possible output fusionskSparseReduction
mhlo_disc.sparse_segment_reduction
for support both tf.sparse_segment_mean
and tf.sparse_segment_sum
sparse_segment_reduction
+ sparse_fill_empty_rows
rewrite to sparse_segment_reduction_with_empty_rows
using pdlllmhlo_disc.sparse_segment_reduction_with_empty_rows
as root nodekSparseReduction
lmhlo.dynamic_reshape
lmhlo.dynamic_broadcast_in_dim
lmhlo.select
Why we want to support
tf.feature_column
sWe have found plenty of models with
tf.feature_column
ops in many industrial models, such as in CTR models. Normally thetf.feature_column
ops are notcomputation intensive
but arememory intensive
and also consume a lot cpu resources due to the runtime costs brought by these kind of ops.tf.feature_column
s are observed to take a large portion of e2e time and cpu resources. Currently we have a pattern-match based impl with custom passed and ops to optimizetf.feature_column
ops. However these pattens needs to be updated frequently due to users all have their custom impl when using a combination oftf.feature_column
ops. Thus supportingtf.feature_column
s in compiler side will give us the ability to support a large variety of user's customizedtf.feature_column
combinations.Long-time Road Map
tf.feature_column
related opstf.feature_column
related opstf.feature_column
related kernels, do some specific optimization withavx512
or so on.Short-time Road Map
Support the follwing 2 feature colums first, then we 20220830: Since we have major changes for origin plans, disable some tasks for now.