What does this PR do?

This PR adds support for pretrained LongT5 encoder-decoder models from HuggingFace.

General Changes

Add HuggingFacePretrainedEncoderDecoderModel class for loading LongT5
Add span masking collator ported from official Google implementation
Add test for span masking collator
Add example training configs for LongT5
Add logic to distinguish between decoder-only and encoder-decoder models when passing inputs

[x] My PR is minimal and addresses one issue in isolation
[x] I have merged the latest version of the target branch into this feature branch
[x] I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
[x] I have run a sample config for model training
[ ] I have checked that all tests run through (python tests/tests.py) (apparently some tests were failing already in upstream? also didn't test multi-GPU)
[ ] I have updated the internal changelog (CHANGELOG_DEV.md)