The audio embedding dimension for now is (batch_size, 8). The number 8 is an arbitrary number. Let's see how things fit in as soon as we start doing the transformer fusion.
@aiden200 — I recommend rebasing your dev branch to this one to resolve potential merge conflicts. That way you will also pick up the audio development proposed in this PR.