Closed jlamypoirier closed 4 weeks ago
Testing time is a problem. This makes tests fasters.
--skip-slow
Also fix tests in test_mb_seq_first.py that were incorrectly skipped (counter-productive but needed)
Total test time went from >15+ minutes to ~7, or 30 seconds with --skip-slow.
Known issues:
test_triton_cross_entropy
test_checkpoint_and_eval
test_model_dp2_sp2_pp2s1
Select all that apply:
Fewer processes and smaller models help a lot, but leaves some offending tests, mainly megatron and multi-gpu tests.
Breakdown of slower tests:
41.13s call tests/test_mb.py::test_model_dp2_tp2_pp2s2_bf4 23.27s call tests/test_mb_seq_first.py::test_model_dp2_sp2_df4 22.37s call tests/test_match_megatron.py::test_mistral_meg 22.36s call tests/test_mb.py::test_model_pp2s1_bf4 21.66s call tests/test_mb.py::test_model_pp2s2_bf4 20.45s call tests/test_match_megatron.py::test_gpt2_meg 20.04s call tests/test_simple.py::test_model_dp2 19.66s call tests/test_match_megatron.py::test_mixtral_meg 19.57s call tests/test_mb.py::test_model_df4_z3 19.38s call tests/test_ms.py::test_model_pp2s2_ms256 17.75s call tests/test_seq_first.py::test_model_sp2 17.35s call tests/test_simple.py::test_model_tp2 17.33s call tests/test_simple.py::test_model_dp2_z2 17.04s call tests/test_seq_first.py::test_model_sp2_ce4 17.01s call tests/test_checkpoint.py::test_load_pretrained_distributed_in_dp2 16.96s call tests/test_checkpoint.py::test_load_pretrained_state_dict_in_dp2 16.84s call tests/test_checkpoint.py::test_load_pretrained_huggingface_in_dp2 16.65s call tests/test_simple.py::test_model_dp2_z3 8.06s call tests/test_checkpoint.py::test_checkpoint_and_eval 6.33s call tests/test_functional.py::test_dropless_mlp 5.62s call tests/test_match_megatron.py::test_mixtral_match_meg 2.80s call tests/test_simple.py::test_model 1.67s call tests/test_checkpoint.py::test_resume [...] 1 failed, 94 passed, 6 skipped, 7 warnings in 405.31s (0:06:45)
✨ Description
Testing time is a problem. This makes tests fasters.
--skip-slow
option.Also fix tests in test_mb_seq_first.py that were incorrectly skipped (counter-productive but needed)
Total test time went from >15+ minutes to ~7, or 30 seconds with
--skip-slow
.Known issues:
test_triton_cross_entropy
fails often.test_checkpoint_and_eval
is suspiciously slow.test_model_dp2_sp2_pp2s1
isn't working🔍 Type of change
Select all that apply:
Details
Fewer processes and smaller models help a lot, but leaves some offending tests, mainly megatron and multi-gpu tests.
Breakdown of slower tests: