facebookresearch / VisualVoice

Audio-Visual Speech Separation with Cross-Modal Consistency
Other
218 stars 35 forks source link

why the sdr performence in the paper cannot be realized #1

Closed junwenxiong closed 3 years ago

junwenxiong commented 3 years ago

I've trained the model for up to 200 thousands epoch, but the sdr performence is only 7.4 in unseen_unheard_test set. It is wondered how long the model has been training in paper. Due to the limitation of the numbers of GPUs, the config of paper can't be implemented. So, any advice will help me.

my training config below --gpu_ids 0,1 \ --batchSize 10 \ --nThreads 16 \ --decay_factor 0.5 \ --num_batch 400000 \ --lr_steps 40000 80000 120000 160000 200000 \ --coseparation_loss_weight 0.01 \ --mixandseparate_loss_weight 1 \ --crossmodal_loss_weight 0.01 \ --lr_lipreading 0.0001 \ --lr_facial_attributes 0.00001 \ --lr_unet 0.0001 \ --lr_vocal_attributes 0.00001 \ --

rhgao commented 3 years ago

Hi, the training was done using batchsize of 128 trained for 50000 batches for the separation model as specified in README. The model was trained on a 8-gpu machine with V100 gpus. I have shared some pre-trained models that you can directly use/test without re-training yourself. Hope that helps.

junwenxiong commented 3 years ago

Thank you so much! Because I only have 2 GPUs,the training period will be longer. Just train it there for a few days!!!

MessyPaste commented 3 years ago

Thank you so much! Because I only have 2 GPUs,the training period will be longer. Just train it there for a few days!!!

Hello, do you reproduce the training result? I have 8 2080ti GPUs. The batchsize is 32. I have still a 2 dB gap from the author.

wxystudio commented 3 years ago

Thank you so much! Because I only have 2 GPUs,the training period will be longer. Just train it there for a few days!!!

Hello, do you reproduce the training result? I have 8 2080ti GPUs. The batchsize is 32. I have still a 2 dB gap from the author.

I can reproduce a result nearly the same as the released model with 8 V100 GPUs, But I don't know how to create the test dataset, I just randomly choose some samples from test dir and obtain sdr=7.8, and sdr=7.3 with the released model

rhgao commented 3 years ago

For the results in the paper, I used the version with context shared here: https://github.com/facebookresearch/VisualVoice/tree/master/av-separation-with-context. I have shared the pre-trained models that you may use. We randomly sampled 2000 pairs for testing, and I just digged out a script that I used in evaluation. Hope that helps!

rhgao commented 3 years ago
  #!/usr/bin/env python
  import os
  import subprocess
  import json
  import random
  import glob
  import h5py
  import argparse
  import pickle

  def split(a, n):
      k, m = divmod(len(a), n)
      return (a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

  #python testBunchSubmit.py --model_root checkpoints/exp6/ --data_root /private/home/rhgao/datasets/VoxCeleb2/seen_heard_test/ --output_dir results/exp6_seen_heard_test --num_of_examples_to_test 2000 --num_of_jobs_to_submit 10

  def main():
      parser = argparse.ArgumentParser()
      parser.add_argument('--model_root', type=str, required=True)
      parser.add_argument('--data_root', type=str, required=True)
      parser.add_argument('--output_dir', type=str, required=True)
      parser.add_argument('--num_of_examples_to_test', type=int, default=10, help="num of examples to test")
      parser.add_argument('--num_of_jobs_to_submit', type=int, default=1, help="num of jobs to submit")
      parser.add_argument('--job_index_start', type=int, default=0, help="num of classes")

      args = parser.parse_args()
      if not os.path.isdir(args.output_dir):
          os.mkdir(args.output_dir)

      IDs = sorted(os.listdir(os.path.join(args.data_root, 'mp4')))
      random.seed(0)

      id_pairs = []
      for i in range(len(IDs)):
          for j in range(len(IDs)):
              if i == j:
                  continue
              id_pairs.append((i,j))

      print(len(id_pairs))

      selected_id_pairs = random.sample(id_pairs, args.num_of_examples_to_test) 

      cmds2execute = []
      for id_pair in selected_id_pairs:
          i = id_pair[0]
          j = id_pair[1]
          video1_name = random.choice(os.listdir(os.path.join(args.data_root, 'mp4', IDs[i])))
          clip1_name = random.choice(os.listdir(os.path.join(args.data_root, 'mp4', IDs[i], video1_name)))
          video1_path = os.path.join(args.data_root, 'mp4', IDs[i], video1_name, clip1_name[:-4] + '.mp4')
          audio1_path = os.path.join(args.data_root, 'aac', IDs[i], video1_name, clip1_name[:-4] + '.wav')
          mouthroi1_path = os.path.join(args.data_root, 'mouth_roi_hdf5', IDs[i], video1_name, clip1_name[:-4] + '.h5')
          video2_name = random.choice(os.listdir(os.path.join(args.data_root, 'mp4', IDs[j])))
          clip2_name = random.choice(os.listdir(os.path.join(args.data_root, 'mp4', IDs[j], video2_name)))
          video2_path = os.path.join(args.data_root, 'mp4', IDs[j], video2_name, clip2_name[:-4] + '.mp4')
          audio2_path = os.path.join(args.data_root, 'aac', IDs[j], video2_name, clip2_name[:-4] + '.wav')
          mouthroi2_path = os.path.join(args.data_root, 'mouth_roi_hdf5', IDs[j], video2_name, clip2_name[:-4] + '.h5')
          cmd = 'python test.py' +  \
              ' --audio1_path ' + audio1_path + \
              ' --audio2_path ' + audio2_path + \
              ' --mouthroi1_path ' + mouthroi1_path + \
              ' --mouthroi2_path ' + mouthroi2_path + \
              ' --video1_path ' + video1_path + \
              ' --video2_path ' + video2_path + \
              ' --num_frames 64 --video_sampling_rate 2 ' + \
              ' --audio_length 2.55 --hop_size 160 --window_size 400 --n_fft 512 ' + \
              ' --weights_lipreadingnet ' + os.path.join(args.model_root, 'lipreading_best.pth') + \
              ' --weights_identity ' + os.path.join(args.model_root, 'identity_best.pth') + \
              ' --weights_unet ' + os.path.join(args.model_root, 'unet_best.pth') + \
              ' --weights_classifier ' + os.path.join(args.model_root, 'classifier_best.pth') + \
              ' --lipreading_config_path configs/lrw_snv1x_tcn2x.json ' + \
              ' --unet_type beta9 --unet_output_nc 2 --normalization --mask_to_use pred ' + \
              ' --visual_feature_type both --identitynet_type resnet18 --voicenet_type resnet18 --identity_feature_dim 128 --audioVisual_feature_dim 1152 --visual_pool maxpool ' + \
              ' --compression_type none --hyperbolic_compression_K 10 --hyperbolic_compression_C 0.1 ' + \
              ' --sigmoidal_compression_a 1 --sigmoidal_compression_b 0 --mask_clip_threshold 5 --hop_length 2.55 ' + \
              ' --reliable_face --lipreading_extract_feature --number_of_identity_frames 1 --output_dir_root ' + os.path.join(args.output_dir)
          cmd = cmd + ' ; python evaluateSeparation.py --results_dir ' + os.path.join(args.output_dir, IDs[i]+'_'+video1_name+'_'+clip1_name[:-4]+'VS'+IDs[j]+'_'+video2_name+'_'+clip2_name[:-4])
          cmd = cmd + '\n'
          cmds2execute.append(cmd)

      job_splits = list(split(range(len(cmds2execute)), args.num_of_jobs_to_submit))
      count = 0
      for sub_split in job_splits:
          count = count + 1
          print(count)
          script2exe = open('slurm_script/' + str(args.job_index_start + count) + '.sh', 'w')
          script2exe.write('#!/bin/bash\n')
          script2exe.write('export CUDA_VISIBLE_DEVICES=0\n')
          for i in sub_split:
              script2exe.write(cmds2execute[i])
          script2exe.close()
          cmd = 'chmod a+x \'slurm_script/' + str(args.job_index_start + count)  + '.sh\''
          subprocess.call(cmd, shell=True)

          # generate slurm submit file
          slurm_file = open('submit.slurm','w')
          slurm_file.write('#!/bin/bash\n')
          slurm_file.write('#SBATCH --job-name=' + 'test-' + str(args.job_index_start + count) + '\n')
          slurm_file.write('#SBATCH --output=' + 'slurm_output/test' + str(args.job_index_start + count) + '.out\n')
          slurm_file.write('#SBATCH --error=' + 'slurm_output/test' + str(args.job_index_start + count) + '.err\n')
          slurm_file.write('#SBATCH --nodes=1\n')
          slurm_file.write('#SBATCH --ntasks-per-node=1\n')
          slurm_file.write('#SBATCH --time 2:00:00\n')
          slurm_file.write('#SBATCH --partition=learnfair\n')  #scavenge
          slurm_file.write('#SBATCH --gres=gpu:1\n')
          slurm_file.write('#SBATCH --cpus-per-task=20\n')
          slurm_file.write('# Module init\n')
          slurm_file.write('module purge\n')
          slurm_file.write('module load anaconda3\n')
          slurm_file.write('module load cuda/10.0 NCCL cudnn\n')
          slurm_file.write('source activate video_clone\n')
          slurm_file.write('srun --label slurm_script/' + str(args.job_index_start + count) + '.sh\n')
          slurm_file.close()
          cmd = 'sbatch submit.slurm'
          subprocess.call(cmd, shell=True)

  if __name__ == '__main__':
      main()
MessyPaste commented 3 years ago

Thank you so much! Because I only have 2 GPUs,the training period will be longer. Just train it there for a few days!!!

Hello, do you reproduce the training result? I have 8 2080ti GPUs. The batchsize is 32. I have still a 2 dB gap from the author.

I can reproduce a result nearly the same as the released model with 8 V100 GPUs, But I don't know how to create the test dataset, I just randomly choose some samples from test dir and obtain sdr=7.8, and sdr=7.3 with the released model

Hello! I trained from scratch and test on my test set, still having a 2 -3 dB gap. I also tested the author's pre-train models and got 9.8 dB, closing to the author's 10.1 dB. So I was wondering whether the batch size is the key problem.

My test set is the following, which is created from the author's unseen unheard test set. All samples are fully randomized. So I got 1711 size : (59*59 - 59)/2.

The test split attachment is below, you can run it and report your result here. Thank you. mixturess_1711_short.txt