Add weight decay filtering and StableAdamW

This PR adds the StableAdamW optimizer as an option. It requires installing optimi pip install torch-optimi. StableAdamW removes the need for gradient clipping, and I've found it to be a pareto improvement over AdamW.

This PR also adds the filter_bias_and_bn option, which prevents weight decay from being applied to linear bias terms and normalization layers. I left it as false to match the current defaults (except in a test) but given its a best practice, we should use it for all of our training.

AnswerDotAI / bert24

Add weight decay filtering and StableAdamW #57