This PR updates the CoCa model so that it can be trained jointly on text-aligned images, audio and video. The webdataset-based dataset and loader are also included.
General Changes
add AudioTransformer model
update the VisionTransformer model for video
add the MultimodalWebDataset dataset for loading audio-text, image-text and video-text in the webdataset format
add a multi-loss function for specifying a weighted-sum of different losses
update the CoCa model to include encoders for video and audio
Breaking Changes
the LLMDataLoader now contains a Pytorch Dataloader object as a member variable instead of inheriting from it.
Checklist before submitting final PR
[ ] My PR is minimal and addresses one issue in isolation
[x] I have merged the latest version of the target branch into this feature branch
[x] I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
[x] I have run a sample config for model training
[x] I have checked that all tests run through (python tests/tests.py) (some tests related to MFU calculation were failing, but I think those are unrelated to this PR)
[x] I have updated the internal changelog (CHANGELOG_DEV.md)
What does this PR do?
This PR updates the CoCa model so that it can be trained jointly on text-aligned images, audio and video. The webdataset-based dataset and loader are also included.
General Changes
Breaking Changes
Checklist before submitting final PR
python tests/tests.py
) (some tests related to MFU calculation were failing, but I think those are unrelated to this PR)CHANGELOG_DEV.md
)