✨ Description

Testing time is a problem. This makes tests fasters.

Use 0 workers because spawning process is slow. (Except in one test so we still test with a non-trivial data loader)
Run the single-gpu tests directly in the pytest process. They go from >10 s to < 500 ms.
Make the model smaller (~130 M -> 6 M)
Skip some less important megatron tests (sc1, sc2)
Mark some tests as slow (megatron, multi-gpu, test_dropless_mlp), add --skip-slow option.

Also fix tests in test_mb_seq_first.py that were incorrectly skipped (counter-productive but needed)

Total test time went from >15+ minutes to ~7, or 30 seconds with --skip-slow.

Known issues:

test_triton_cross_entropy fails often.
test_checkpoint_and_eval is suspiciously slow.
test_model_dp2_sp2_pp2s1 isn't working

🔍 Type of change

Select all that apply:

[ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)
[ ] 🚀 New feature (non-breaking change that adds functionality)
[ ] ⚠️ Breaking change (a change that could affect existing functionality)
[x] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
[ ] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
[ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
[ ] 📝 Documentation change (updates documentation, including new content or typo fixes)
[ ] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

Details

Fewer processes and smaller models help a lot, but leaves some offending tests, mainly megatron and multi-gpu tests.

Breakdown of slower tests:

41.13s call     tests/test_mb.py::test_model_dp2_tp2_pp2s2_bf4
23.27s call     tests/test_mb_seq_first.py::test_model_dp2_sp2_df4
22.37s call     tests/test_match_megatron.py::test_mistral_meg
22.36s call     tests/test_mb.py::test_model_pp2s1_bf4
21.66s call     tests/test_mb.py::test_model_pp2s2_bf4
20.45s call     tests/test_match_megatron.py::test_gpt2_meg
20.04s call     tests/test_simple.py::test_model_dp2
19.66s call     tests/test_match_megatron.py::test_mixtral_meg
19.57s call     tests/test_mb.py::test_model_df4_z3
19.38s call     tests/test_ms.py::test_model_pp2s2_ms256
17.75s call     tests/test_seq_first.py::test_model_sp2
17.35s call     tests/test_simple.py::test_model_tp2
17.33s call     tests/test_simple.py::test_model_dp2_z2
17.04s call     tests/test_seq_first.py::test_model_sp2_ce4
17.01s call     tests/test_checkpoint.py::test_load_pretrained_distributed_in_dp2
16.96s call     tests/test_checkpoint.py::test_load_pretrained_state_dict_in_dp2
16.84s call     tests/test_checkpoint.py::test_load_pretrained_huggingface_in_dp2
16.65s call     tests/test_simple.py::test_model_dp2_z3
8.06s call     tests/test_checkpoint.py::test_checkpoint_and_eval
6.33s call     tests/test_functional.py::test_dropless_mlp
5.62s call     tests/test_match_megatron.py::test_mixtral_match_meg
2.80s call     tests/test_simple.py::test_model
1.67s call     tests/test_checkpoint.py::test_resume
[...]
1 failed, 94 passed, 6 skipped, 7 warnings in 405.31s (0:06:45)

ServiceNow / Fast-LLM

Faster tests #23

✨ Description

🔍 Type of change

Details