Training the model with different data format

Hi im relatively new to ml and want to try training an audio to co gesture model that is more suitable for game engines like unreal engine 5

my question is if I input a different type of data to the body VQ VAE model where instead of joint angles i input join positions in 3d space and then input the vector quantized codes from that to the body transformer to essentially train a similar mode that outputs joint position instead of joint angles

I am wondering how practical is this and would it work if so what considerations would i have to make

any help would be greatly appreciated , thanks!

facebookresearch / audio2photoreal

Training the model with different data format #49