ServiceNow / Fast-LLM

Accelerating your LLM training to full speed
https://servicenow.github.io/Fast-LLM/
Other
37 stars 5 forks source link

Faster tests #23

Closed jlamypoirier closed 4 weeks ago

jlamypoirier commented 4 weeks ago

✨ Description

Testing time is a problem. This makes tests fasters.

Also fix tests in test_mb_seq_first.py that were incorrectly skipped (counter-productive but needed)

Total test time went from >15+ minutes to ~7, or 30 seconds with --skip-slow.

Known issues:

🔍 Type of change

Select all that apply:

Details

Fewer processes and smaller models help a lot, but leaves some offending tests, mainly megatron and multi-gpu tests.

Breakdown of slower tests:

41.13s call     tests/test_mb.py::test_model_dp2_tp2_pp2s2_bf4
23.27s call     tests/test_mb_seq_first.py::test_model_dp2_sp2_df4
22.37s call     tests/test_match_megatron.py::test_mistral_meg
22.36s call     tests/test_mb.py::test_model_pp2s1_bf4
21.66s call     tests/test_mb.py::test_model_pp2s2_bf4
20.45s call     tests/test_match_megatron.py::test_gpt2_meg
20.04s call     tests/test_simple.py::test_model_dp2
19.66s call     tests/test_match_megatron.py::test_mixtral_meg
19.57s call     tests/test_mb.py::test_model_df4_z3
19.38s call     tests/test_ms.py::test_model_pp2s2_ms256
17.75s call     tests/test_seq_first.py::test_model_sp2
17.35s call     tests/test_simple.py::test_model_tp2
17.33s call     tests/test_simple.py::test_model_dp2_z2
17.04s call     tests/test_seq_first.py::test_model_sp2_ce4
17.01s call     tests/test_checkpoint.py::test_load_pretrained_distributed_in_dp2
16.96s call     tests/test_checkpoint.py::test_load_pretrained_state_dict_in_dp2
16.84s call     tests/test_checkpoint.py::test_load_pretrained_huggingface_in_dp2
16.65s call     tests/test_simple.py::test_model_dp2_z3
8.06s call     tests/test_checkpoint.py::test_checkpoint_and_eval
6.33s call     tests/test_functional.py::test_dropless_mlp
5.62s call     tests/test_match_megatron.py::test_mixtral_match_meg
2.80s call     tests/test_simple.py::test_model
1.67s call     tests/test_checkpoint.py::test_resume
[...]
1 failed, 94 passed, 6 skipped, 7 warnings in 405.31s (0:06:45)