perf(dipu): faster aten::mul in cuda & muxi

DeepLink-org / deeplink.framework

BSD 3-Clause "New" or "Revised" License

56 stars 28 forks source link

perf(dipu): faster aten::mul in cuda & muxi #855

Closed Wrench-Git closed 2 months ago

Wrench-Git commented 2 months ago

This change remove a redundency transformation mul_tensor->mul_scalar->mul_tensor. Also with a faster BinaryOpInferrer. The aten::mul is almost as fast as torch in cpu avg.