alibaba / BladeDISC

BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.
Apache License 2.0
811 stars 160 forks source link

Support for `tf.feature_column`s in disc #510

Open Orion34-lanbo opened 2 years ago

Orion34-lanbo commented 2 years ago

Why we want to support tf.feature_columns

We have found plenty of models with tf.feature_column ops in many industrial models, such as in CTR models. Normally the tf.feature_column ops are not computation intensive but are memory intensive and also consume a lot cpu resources due to the runtime costs brought by these kind of ops.

tf.feature_columns are observed to take a large portion of e2e time and cpu resources. Currently we have a pattern-match based impl with custom passed and ops to optimize tf.feature_column ops. However these pattens needs to be updated frequently due to users all have their custom impl when using a combination of tf.feature_column ops. Thus supporting tf.feature_columns in compiler side will give us the ability to support a large variety of user's customized tf.feature_column combinations.

Long-time Road Map

Short-time Road Map

Support the follwing 2 feature colums first, then we 20220830: Since we have major changes for origin plans, disable some tasks for now.

- [ ] Supoort `bucketize_column`
- [ ] Support [indicator_column](https://www.tensorflow.org/api_docs/python/tf/feature_column/indicator_column)
- [ ] Support clustring for these 2 type of feature colums related ops
- [ ] Profiling to find out the potential perf issue
Orion34-lanbo commented 2 years ago

Update 20220819: For SparseToDense op, we are able to lowering to mhlo operations. After lowering, we have found out that Disc does not support mhlo::ScatterOp's codegen. Adding full support for scatter will take a considerable amount of time, thus we will hold this action and continue with supporting tf.embedding_column.

Orion34-lanbo commented 2 years ago

For Unique op in tf.embedding_column, we have noticed that grappler's ArithmeticOptimizer will do a simplification for unique+gather+sparsesegmentxxx pattern to sparsesegmentxxx. We need to make sure whether we need to support Unique for tf.embedding_column.

Orion34-lanbo commented 2 years ago

Updates 22020830: We have decided to support lowering for tf.embedding_column related ops by directly emit code for the following ops to make DISC support tf.embedding_column

Following pengzhan's origin poc impl, we will directly lowering the 4 ops to corresponding ops as in mhlo_disc and lmhlo_disc Dialect, and will do codegen for these ops in DiscLhloLegalizeRootsToParallelLoops pass.

Latest TODO list for supporting tf.embedding_column on X86 device:

Orion34-lanbo commented 1 year ago

Updates 22020830: We have decided to support lowering for tf.embedding_column related ops by directly emit code for the following ops to make DISC support tf.embedding_column

  • tf.SparseReshape
  • tf.SparseFillEmptyRows
  • tf.SparseSegmentMean
  • tf.Where

Following pengzhan's origin poc impl, we will directly lowering the 4 ops to corresponding ops as in mhlo_disc and lmhlo_disc Dialect, and will do codegen for these ops in DiscLhloLegalizeRootsToParallelLoops pass.

Latest TODO list for supporting tf.embedding_column on X86 device:

  • [x] codegen support for tf.SparseReshape(Still need a patch to support -1 in new_shape )
  • [x] codegen support fortf.SparseFillEmptyRows
  • [x] codegen support fortf.SparseSegmentMean(Need to do some perf improvement)
  • [x] codegen support fortf.Where
  • [x] Support clustering for these 4 ops
  • [ ] Benchmark for popular models with feature columns
  • [ ] support multi-threading code generation for these ops
  • [ ] support fusion for these ops

Update 20221205: After initial tests on EasyRec models, we have out-perform tensorflow baseline, however we compare codegen perf with hand-writing kernels, the gaps is inevitable. Thus, we need to update the ongoing items as follows:

Orion34-lanbo commented 1 year ago

Update 20221223: Recently, we have done a poc for output fusion for lmhlo_disc.where op, the entire poc consist of several part of code change that we intend to split into the following for commit to main branch.

Orion34-lanbo commented 1 year ago

Update 20230302: Output fusion for lmhlo_disc.where did not bring enough perf enhancement on EasyRec model. After profiling and detailed analysis, we have found out that lmhlo_disc.sparse_segment_reduction carry lots of computation for embedding_column. We have done series of optimization since then.

latest perf result

perf is tested on Bare Metal Server used only by myself, with the following 128 * Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz

opt latency(ms) speed-up
baseline 9.77 -
hand-write-fusion-opt 6.79 1.43x
disc 8.05 1.21x

We have achived a 1.21x speed-up for tf's baseline, however we still have a 18.5% gap with hand-write-fusion-opt.

optimization pocs

items for code merge pr