Orion34-lanbo commented 2 years ago

Why we want to support `tf.feature_column`s

We have found plenty of models with tf.feature_column ops in many industrial models, such as in CTR models. Normally the tf.feature_column ops are not computation intensive but are memory intensive and also consume a lot cpu resources due to the runtime costs brought by these kind of ops.

tf.feature_columns are observed to take a large portion of e2e time and cpu resources. Currently we have a pattern-match based impl with custom passed and ops to optimize tf.feature_column ops. However these pattens needs to be updated frequently due to users all have their custom impl when using a combination of tf.feature_column ops. Thus supporting tf.feature_columns in compiler side will give us the ability to support a large variety of user's customized tf.feature_column combinations.

Long-time Road Map

Support lowering of all tf.feature_column related ops
Support clustering of all tf.feature_column related ops
Based on perf of codegen tf.feature_column related kernels, do some specific optimization with avx512 or so on.

Short-time Road Map

Support the follwing 2 feature colums first, then we 20220830: Since we have major changes for origin plans, disable some tasks for now.

- [ ] Supoort `bucketize_column`
- [ ] Support [indicator_column](https://www.tensorflow.org/api_docs/python/tf/feature_column/indicator_column)
- [ ] Support clustring for these 2 type of feature colums related ops
- [ ] Profiling to find out the potential perf issue

Orion34-lanbo commented 2 years ago

Update 20220819: For SparseToDense op, we are able to lowering to mhlo operations. After lowering, we have found out that Disc does not support mhlo::ScatterOp's codegen. Adding full support for scatter will take a considerable amount of time, thus we will hold this action and continue with supporting tf.embedding_column.

Orion34-lanbo commented 2 years ago

For Unique op in tf.embedding_column, we have noticed that grappler's ArithmeticOptimizer will do a simplification for unique+gather+sparsesegmentxxx pattern to sparsesegmentxxx. We need to make sure whether we need to support Unique for tf.embedding_column.

Orion34-lanbo commented 2 years ago

Updates 22020830: We have decided to support lowering for tf.embedding_column related ops by directly emit code for the following ops to make DISC support tf.embedding_column

tf.SparseReshape
tf.SparseFillEmptyRows
tf.SparseSegmentMean
tf.Where

Following pengzhan's origin poc impl, we will directly lowering the 4 ops to corresponding ops as in mhlo_disc and lmhlo_disc Dialect, and will do codegen for these ops in DiscLhloLegalizeRootsToParallelLoops pass.

Latest TODO list for supporting tf.embedding_column on X86 device:

[x] codegen support for tf.SparseReshape(Still need a patch to support -1 in new_shape )
[x] codegen support fortf.SparseFillEmptyRows
[x] codegen support fortf.SparseSegmentMean(Need to do some perf improvement)
[x] codegen support fortf.Where
[x] Support clustering for these 4 ops
[ ] Benchmark for popular models with feature columns
[ ] support multi-threading code generation for these ops
[ ] support fusion for these ops

Orion34-lanbo commented 1 year ago

Updates 22020830: We have decided to support lowering for tf.embedding_column related ops by directly emit code for the following ops to make DISC support tf.embedding_column

tf.SparseReshape

tf.SparseFillEmptyRows

tf.SparseSegmentMean

tf.Where

Following pengzhan's origin poc impl, we will directly lowering the 4 ops to corresponding ops as in mhlo_disc and lmhlo_disc Dialect, and will do codegen for these ops in DiscLhloLegalizeRootsToParallelLoops pass.

Latest TODO list for supporting tf.embedding_column on X86 device:

[x] codegen support for tf.SparseReshape(Still need a patch to support -1 in new_shape )

[x] codegen support fortf.SparseFillEmptyRows

[x] codegen support fortf.SparseSegmentMean(Need to do some perf improvement)

[x] codegen support fortf.Where

[x] Support clustering for these 4 ops

[ ] Benchmark for popular models with feature columns

[ ] support multi-threading code generation for these ops

[ ] support fusion for these ops

Update 20221205: After initial tests on EasyRec models, we have out-perform tensorflow baseline, however we compare codegen perf with hand-writing kernels, the gaps is inevitable. Thus, we need to update the ongoing items as follows:

[x] replace scf.for with scf.parallel for current codegen impl
[x] support input fusion with element-wise ops with mhlo_disc.where
[x] support output fusion for mhlo_disc.where
[x] simplify mhlo.sparse_reshape, maybe work as normal reshape op
[ ] use avx256 or avx512 to speed up kernel perf

Orion34-lanbo commented 1 year ago

Update 20221223: Recently, we have done a poc for output fusion for lmhlo_disc.where op, the entire poc consist of several part of code change that we intend to split into the following for commit to main branch.

[ ] sparse op rewrite pass part 1: eliminate unnecessary mhlo.sparse_reshape for 2d sparse tensor reshape
[ ] sparse op rewirte pass part 2: rewrite mhlo.real_dynamic_slice + mhlo.gather to mhlo.gather + mhlo.real_dynamic_slice for mhlo.real_dynamic_slicefromwhere` op
[ ] refactor for sparse op fusion to make a SparseOpCpuFusionStrategy to support output fusion
[ ] output inline fusion pass part 1: inline fuse lmhlo.dynamic_reshape
[ ] output inline fusion pass part 2: inline fusse lmhlo.dynamic_gather
[ ] other small changes to support e2e flow of lmhlo_disc.where op's output fusion

Orion34-lanbo commented 1 year ago

Update 20230302: Output fusion for lmhlo_disc.where did not bring enough perf enhancement on EasyRec model. After profiling and detailed analysis, we have found out that lmhlo_disc.sparse_segment_reduction carry lots of computation for embedding_column. We have done series of optimization since then.

latest perf result

perf is tested on Bare Metal Server used only by myself, with the following 128 * Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz

opt	latency(ms)	speed-up
baseline	9.77	-
hand-write-fusion-opt	6.79	1.43x
disc	8.05	1.21x

We have achived a 1.21x speed-up for tf's baseline, however we still have a 18.5% gap with hand-write-fusion-opt.

optimization pocs

sparse segment reduction refactor
sparse_segment_reduction + sparse_fill_empty_rows rewrite to sparse_segment_reduction_with_empty_rows(only works for inference case)
output fusion for sparse_segment_reduction_with_empty_rows with fusion type kSparseReduction
schedule optimization for sparse_segment_reduction for possible output fusions
output inline fusion for kSparseReduction

items for code merge pr

[x] sparse fusion refactor
[x] add mhlo_disc.sparse_segment_reduction for support both tf.sparse_segment_mean and tf.sparse_segment_sum
[x] sparse reshape elimination using pdll
[x] sparse_segment_reduction + sparse_fill_empty_rows rewrite to sparse_segment_reduction_with_empty_rows using pdll
[x] codegen logic for lmhlo_disc.sparse_segment_reduction_with_empty_rows as root node
[ ] output fusion for kSparseReduction
[ ] basic framework for output inline fusion with utils:
- collectStoreToSameMemref
- getStorableOperation
- inlineFuseLhloOp
[ ] output inline fusion support for lmhlo.dynamic_reshape
[ ] output inline fusion support for lmhlo.dynamic_broadcast_in_dim
[ ] output inline fusion support for lmhlo.select

alibaba / BladeDISC

Support for `tf.feature_column`s in disc #510

Why we want to support `tf.feature_column`s

Long-time Road Map

Short-time Road Map

latest perf result

optimization pocs

items for code merge pr

alibaba / BladeDISC

Support for `tf.feature_column`s in disc #510

Why we want to support tf.feature_columns

Long-time Road Map

Short-time Road Map

latest perf result

optimization pocs

items for code merge pr

Why we want to support `tf.feature_column`s