Closed liqiangxl closed 1 year ago
Actually, do we want to do the welford translation if the initial scheduler is not the persistent scheduler? If it's originally just the reduction scheduler, making it to persistent may result in lower performance. Can you do some benchmarking?
Actually, do we want to do the welford translation if the initial scheduler is not the persistent scheduler? If it's originally just the reduction scheduler, making it to persistent may result in lower performance. Can you do some benchmarking?
I tested this var_mean and there is no performance change whether we translate or not. In either case, we only need to read the data from gmem once.
Actually, do we want to do the welford translation if the initial scheduler is not the persistent scheduler? If it's originally just the reduction scheduler, making it to persistent may result in lower performance. Can you do some benchmarking?
I tested this var_mean and there is no performance change whether we translate or not. In either case, we only need to read the data from gmem once.
Thanks for the testing. Then let's not do the translation. It's just an additional lowering overhead.
Fixes #2486 reduction heuristic was used while the correct one should be persistent.