Adds ragged attention. - Githubissues

Adds Ragged Attention Pallas kernel as an option when performing autoregressive attention. By default this is disabled and does not interfere with the existing AR attention. It can be enabled by the CLI argument use_ragged_attention=true.

Improvements are most noticeable when

quantize_kvcache=false
ar_cache_axis_order = prefill_cache_axis_order = "0,2,1,3"
Performance improves as max_prefill_predict_length and max_target_length increase.

AI-Hypercomputer / maxtext

Adds ragged attention. #835