Mlperf/4.1 grain - Githubissues

AI-Hypercomputer / maxtext

A simple, performant and scalable Jax LLM!

Apache License 2.0

1.47k stars 275 forks source link

Use a custom grain wheel (Need to rebuild dependency image, see changes in requirements.txt and setup.sh)
Add grain support for per_device_batch_size<1 while eval_per_device_batch_size>1
Run on v5p-128 using xpk (per_device_batch_size=0.5, eval_per_device_batch_size=1), using the script in this PR. command: bash MaxText/configs/v5e/mlperf-grain-tpu.sh PER_DEVICE_BATCH_SIZE=0.5 GRAIN_WORKER_COUNT=1 ICI_TENSOR=16 PLATFORM=gke Results: https://cloudlogging.app.goo.gl/5BpmD9iCEm7uVGNx5
The padding batch for eval doesn't work well. Suggest to first test with eval off (eval_interval=-1), then try eval on. Eval may not terminate but hang on some slice sizes. I'll continue working on that.

AI-Hypercomputer / maxtext