Add Gemma attention head size

This fix adds attention_head_size which is different for Gemma, equal to 256.

Without this, during module building, the runtime would ignore the mismatched shapes and result in unintelligible results.

00:15:21.191 [debug] the following parameters were ignored, because of non-matching shape:

  * decoder.blocks.24.self_attention.value.kernel (expected {3072, 3072}, got: {3072, 4096})
  * decoder.blocks.18.self_attention.value.kernel (expected {3072, 3072}, got: {3072, 4096})
  * decoder.blocks.18.self_attention.output.kernel (expected {3072, 3072}, got: {4096, 3072})
  ...

See:

https://github.com/google-deepmind/gemma/issues/4

elixir-nx / bumblebee

Add Gemma attention head size #364