Open dengyuanjie opened 2 years ago
Because the separated sound can be of low quality, especially in the initial stage of training, making their embeddings bad. However, the ground-truth audios/embeddings are always reliable. So this step is just to make sure the cross-modal loss or co-separation loss learns from meaningful feature embeddings that lead to meaningful distance metrics in the embedding space. Then, the loss using the predicted embeddings is the part that actually helps the separation learning.
Thanks for your quick reply!
Thank you very much for your excellent work. One problem I am confused about is the definition of the
crossmodal loss function
andcoseparation loss function
. In the train.py, why random numbers andopt.gt_percentage
are used to select which audio feature (audio_embedding_A1_pred
oraudio_embedding_A1_gt
) is used. According to the method of the paper, shouldn't the predictive features be used?