Implement Gemma 2 models

Need to modify Gemma model implementation with:

Changelist over original Gemma and status:

[x] Sliding window attn - for layers that satisfy idx % 2 != 0, so every other, will use sliding window
- Affects KV cache retrieval
- Affects sliding window mask generation

[x] Logit soft capping

In attention, between QK^T s and matmul(V)

if self.config.attn_logit_softcapping is not None:
    attn_weights = attn_weights / self.config.attn_logit_softcapping
    attn_weights = torch.tanh(attn_weights)
    attn_weights = attn_weights * self.config.attn_logit_softcapping

After lm head

if self.config.final_logit_softcapping is not None:
    logits = logits / self.config.final_logit_softcapping
    logits = torch.tanh(logits)
    logits = logits * self.config.final_logit_softcapping

[x] Use query_pre_attn_scalar instead of 1/sqrt(head_dim)

EricLBuehler / mistral.rs

Implement Gemma 2 models #486

Changelist over original Gemma and status:

Links