Grid optimization - Chunk_Size optimization.

aspanday commented 1 year ago

This change only affects the optimizers specifically when multi_tensor_apply is emabled using --cuda_ext extension when installing apex.

Updating chunk_size to 256 32 (8K) which was previously 2048 32 (64K). In addition, updating depth_to_max_blocks to 2560 (8x compared to previous 320).

The performance improvement observed is upto 1.4x for large number of elements, upto 5.2x for moderate number of elements and upto 1.44x for small number of elements.

The set of performance along with comparison with Torch is captured here https://amdcloud.sharepoint.com/:x:/r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/Grid%20Optimization/Elementwise%20Kernel%20-%20Grid%20Optimization%20-%20Benchmark%20sweep.xlsx?d=wa8bacf65a2904002bf3cad4c57769eff&csf=1&web=1&e=JhLVm8

See sheet "chunk_opt".

All tests in test_fused_optimizers.py passed (6 skipped).

hubertlu-tw commented 1 year ago

jenkins: retest this please

hubertlu-tw commented 1 year ago

@aspanday It seems that test_fuzz (test_multi_tensor_l2norm.TestMultiTensorL2Norm) failed due to this PR. Do you know how to resolve it?

aspanday commented 1 year ago

The new updates resolve the test_fuzz (test_multi_tensor_l2norm.TestMultiTensorL2Norm) issue. I've also tested it against

python run_tests.py --include run_amp python run_tests.py --include run_optimizers.

All tests passed.

ROCm / apex

Grid optimization - Chunk_Size optimization. #104