What does this PR do?

This PR updates the CoCa model so that it can be trained jointly on text-aligned images, audio and video. The webdataset-based dataset and loader are also included.

General Changes

add AudioTransformer model
update the VisionTransformer model for video
add the MultimodalWebDataset dataset for loading audio-text, image-text and video-text in the webdataset format
add a multi-loss function for specifying a weighted-sum of different losses
update the CoCa model to include encoders for video and audio

Breaking Changes

the LLMDataLoader now contains a Pytorch Dataloader object as a member variable instead of inheriting from it.

Checklist before submitting final PR

[ ] My PR is minimal and addresses one issue in isolation
[x] I have merged the latest version of the target branch into this feature branch
[x] I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
[x] I have run a sample config for model training
[x] I have checked that all tests run through (python tests/tests.py) (some tests related to MFU calculation were failing, but I think those are unrelated to this PR)
[x] I have updated the internal changelog (CHANGELOG_DEV.md)

Modalities / modalities

Feat/coca #263

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR